[🐛BUG] Implausible metrics? #622

deklanw · 2020-12-25T23:57:05Z

Trying out my implementation of SLIM with ElasticNet #621 I'm noticing some implausible numbers. Dataset is ml-100k with all defaults. Using default hyperparameters of my method defined in its yaml file (not yet well-chosen because these results are so off) https://github.com/RUCAIBox/RecBole/blob/41a06e59ab26482dbfac641caac99876c167168c/recbole/properties/model/SLIMElastic.yaml

Using this standard copy-pasted code

dataset_name = "ml-100k"

model = SLIMElastic

config = Config(model=model, dataset=dataset_name)
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

logger.info(config)

# dataset filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = model(config, train_data).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)

logger.info('best valid result: {}'.format(best_valid_result))
logger.info('test result: {}'.format(test_result))

Results:
INFO test result: {'recall@10': 0.8461, 'mrr@10': 0.5374, 'ndcg@10': 0.7102, 'hit@10': 1.0, 'precision@10': 0.6309}

Also, my HyperOpt log is highly suspicious

alpha:0.316482837679784, hide_item:False, l1_ratio:0.9890017268444972, positive_only:False
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.47984629320482386, hide_item:False, l1_ratio:0.9907136437218732, positive_only:True
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.9530393537754144, hide_item:True, l1_ratio:0.24064058250190196, positive_only:True
Valid result:
recall@10 : 0.6251    mrr@10 : 0.3611    ndcg@10 : 0.4954    hit@10 : 0.9650    precision@10 : 0.4709    
Test result:
recall@10 : 0.6535    mrr@10 : 0.4012    ndcg@10 : 0.5357    hit@10 : 0.9745    precision@10 : 0.5019

Exact same results with different parameters?

I figure if there is a mistake in my implementation it would cause bad performance, not amazing performance.

Anyone know what could be causing this?

The text was updated successfully, but these errors were encountered:

ShanleiMu · 2020-12-26T01:29:46Z

I print the prediction scores. It seems all the scores are 0, which causes the implausible metrics.

To speed up the full sort prediction, we put the ground truth items at the beginning of the list to be sorted and it is a stable sort. So if all the candidate items have the same score, it will achieve high performance.

I must admit that the current sorting evaluation method is not very good.

deklanw · 2020-12-26T03:59:12Z

@ShanleiMu Ah! I see. That is tricky.

The output of all 0s is I, I believe, caused by the hyperparameter controlling the L1 coefficient in the regression being too large. I was hoping to use HyperOpt to determine what reasonable values for that coefficient are (it's probably problem-specific), but with the evaluation working like this, I don't see how I could make the determination!

Is there a way around this?

tsotfsk · 2020-12-27T09:31:56Z

I print the prediction scores. It seems all the scores are 0, which causes the implausible metrics.

To speed up the full sort prediction, we put the ground truth items at the beginning of the list to be sorted and it is a stable sort. So if all the candidate items have the same score, it will achieve high performance.

I must admit that the current sorting evaluation method is not very good.

Thanks for Shan Lei's quick reply. I'd like to make some supplementary explanations. In fact, none of the TopK metrics we implement can handle the items which have the same score (GAUC is an exception, because GAUC uses the average rank as a solution). Frankly speaking, for deep learning recommendation algorithms, in my understanding, items have the same score is a very low probability event. Most open source recommendation codes, such as recommenders, NeuRec do not deal with this situation. But for the non deep learning algorithms, it may be common that some items have the same score, especially in the early stage of the training.

The sort trick we use will cause positive items always appear at the top of the items which have the same score. Because we have a re-organizing stage to speed up the evaluation. For more information about this trick, you can click evaluation.

Maybe we should keep the randomness of positive items‘ position when we have the same score. We will discuss this problem and try to come up with a feasible solution to alleviate this problem！

tsotfsk · 2020-12-27T14:57:44Z

Hi! @deklanw. I think randomly generating a small range of numbers and adding them to the score may solve this problem. Because it can avoid some items having the same score. If the model does not learn any information, the result will be very poor. If the model learns some obvious information, the randomly generated small number will not have a great impact on the result, which may help you to determine the range of the L1 parameter.

deklanw · 2020-12-27T17:46:00Z

@tsotfsk Thanks, that all makes sense.

The adding random noise idea worked well. But, it can't be a temporary solution because I believe the appropriate L1 hyperparameter depends on the problem. I've chosen hyperparameter defaults which work well on ml-100k but that's as far as I can tell.

def add_noise(t, mag=1e-5):
    return t + mag * torch.rand(t.shape)

...

    def predict(self, interaction):
        user = interaction[self.USER_ID].cpu().numpy()
        item = interaction[self.ITEM_ID].cpu().numpy()

        r = torch.from_numpy((self.interaction_matrix[user, :].multiply(
            self.item_similarity[:, item].T)).sum(axis=1).getA1())

        return add_noise(r)

    def full_sort_predict(self, interaction):
        user = interaction[self.USER_ID].cpu().numpy()

        r = self.interaction_matrix[user, :] @ self.item_similarity
        r = torch.from_numpy(r.todense().getA1())

        return add_noise(r)

I'm fine with leaving this in permanently.

Seem fine?

tsotfsk · 2020-12-28T05:45:36Z

Hi! @deklanw. 😊 your code looks fine, and I tested the HyperOpt module after adding noise and it also works well. The phenomenon of implausible metrics will disappear and the effect of noise on the results is very small and can be ignored.

deklanw · 2020-12-28T16:00:06Z

Thanks for the help

deklanw added the bug Something isn't working label Dec 25, 2020

deklanw closed this as completed Dec 28, 2020

tsotfsk mentioned this issue Jan 26, 2021

[🐛BUG] Different default run_recbole results for v.1.2 and v0.2.0 #699

Closed

tsotfsk mentioned this issue Mar 5, 2021

[Question] Performance Issue on General Methods #753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛BUG] Implausible metrics? #622

[🐛BUG] Implausible metrics? #622

deklanw commented Dec 25, 2020 •

edited

Loading

ShanleiMu commented Dec 26, 2020 •

edited

Loading

deklanw commented Dec 26, 2020 •

edited

Loading

tsotfsk commented Dec 27, 2020

tsotfsk commented Dec 27, 2020

deklanw commented Dec 27, 2020 •

edited

Loading

tsotfsk commented Dec 28, 2020

deklanw commented Dec 28, 2020

[🐛BUG] Implausible metrics? #622

[🐛BUG] Implausible metrics? #622

Comments

deklanw commented Dec 25, 2020 • edited Loading

ShanleiMu commented Dec 26, 2020 • edited Loading

deklanw commented Dec 26, 2020 • edited Loading

tsotfsk commented Dec 27, 2020

tsotfsk commented Dec 27, 2020

deklanw commented Dec 27, 2020 • edited Loading

tsotfsk commented Dec 28, 2020

deklanw commented Dec 28, 2020

deklanw commented Dec 25, 2020 •

edited

Loading

ShanleiMu commented Dec 26, 2020 •

edited

Loading

deklanw commented Dec 26, 2020 •

edited

Loading

deklanw commented Dec 27, 2020 •

edited

Loading