Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] Implausible metrics? #622

Closed
deklanw opened this issue Dec 25, 2020 · 7 comments
Closed

[🐛BUG] Implausible metrics? #622

deklanw opened this issue Dec 25, 2020 · 7 comments
Labels
bug Something isn't working

Comments

@deklanw
Copy link
Contributor

deklanw commented Dec 25, 2020

Trying out my implementation of SLIM with ElasticNet #621 I'm noticing some implausible numbers. Dataset is ml-100k with all defaults. Using default hyperparameters of my method defined in its yaml file (not yet well-chosen because these results are so off) https://github.com/RUCAIBox/RecBole/blob/41a06e59ab26482dbfac641caac99876c167168c/recbole/properties/model/SLIMElastic.yaml

Using this standard copy-pasted code

dataset_name = "ml-100k"

model = SLIMElastic

config = Config(model=model, dataset=dataset_name)
init_seed(config['seed'], config['reproducibility'])

# logger initialization
init_logger(config)
logger = getLogger()

logger.info(config)

# dataset filtering
dataset = create_dataset(config)
logger.info(dataset)

# dataset splitting
train_data, valid_data, test_data = data_preparation(config, dataset)

# model loading and initialization
model = model(config, train_data).to(config['device'])
logger.info(model)

# trainer loading and initialization
trainer = Trainer(config, model)

# model training
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)

# model evaluation
test_result = trainer.evaluate(test_data)

logger.info('best valid result: {}'.format(best_valid_result))
logger.info('test result: {}'.format(test_result))

Results:
INFO test result: {'recall@10': 0.8461, 'mrr@10': 0.5374, 'ndcg@10': 0.7102, 'hit@10': 1.0, 'precision@10': 0.6309}

Also, my HyperOpt log is highly suspicious

alpha:0.316482837679784, hide_item:False, l1_ratio:0.9890017268444972, positive_only:False
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.47984629320482386, hide_item:False, l1_ratio:0.9907136437218732, positive_only:True
Valid result:
recall@10 : 0.8461    mrr@10 : 0.5368    ndcg@10 : 0.7099    hit@10 : 1.0000    precision@10 : 0.6309    
Test result:
recall@10 : 0.8461    mrr@10 : 0.5374    ndcg@10 : 0.7102    hit@10 : 1.0000    precision@10 : 0.6309

...

alpha:0.9530393537754144, hide_item:True, l1_ratio:0.24064058250190196, positive_only:True
Valid result:
recall@10 : 0.6251    mrr@10 : 0.3611    ndcg@10 : 0.4954    hit@10 : 0.9650    precision@10 : 0.4709    
Test result:
recall@10 : 0.6535    mrr@10 : 0.4012    ndcg@10 : 0.5357    hit@10 : 0.9745    precision@10 : 0.5019    

Exact same results with different parameters?

I figure if there is a mistake in my implementation it would cause bad performance, not amazing performance.

Anyone know what could be causing this?

@deklanw deklanw added the bug Something isn't working label Dec 25, 2020
@ShanleiMu
Copy link
Member

ShanleiMu commented Dec 26, 2020

I print the prediction scores. It seems all the scores are 0, which causes the implausible metrics.

To speed up the full sort prediction, we put the ground truth items at the beginning of the list to be sorted and it is a stable sort. So if all the candidate items have the same score, it will achieve high performance.

1226

I must admit that the current sorting evaluation method is not very good.

@deklanw
Copy link
Contributor Author

deklanw commented Dec 26, 2020

@ShanleiMu Ah! I see. That is tricky.

The output of all 0s is I, I believe, caused by the hyperparameter controlling the L1 coefficient in the regression being too large. I was hoping to use HyperOpt to determine what reasonable values for that coefficient are (it's probably problem-specific), but with the evaluation working like this, I don't see how I could make the determination!

Is there a way around this?

@tsotfsk
Copy link
Contributor

tsotfsk commented Dec 27, 2020

I print the prediction scores. It seems all the scores are 0, which causes the implausible metrics.

To speed up the full sort prediction, we put the ground truth items at the beginning of the list to be sorted and it is a stable sort. So if all the candidate items have the same score, it will achieve high performance.

1226

I must admit that the current sorting evaluation method is not very good.

Thanks for Shan Lei's quick reply. I'd like to make some supplementary explanations. In fact, none of the TopK metrics we implement can handle the items which have the same score (GAUC is an exception, because GAUC uses the average rank as a solution). Frankly speaking, for deep learning recommendation algorithms, in my understanding, items have the same score is a very low probability event. Most open source recommendation codes, such as recommenders, NeuRec do not deal with this situation. But for the non deep learning algorithms, it may be common that some items have the same score, especially in the early stage of the training.

The sort trick we use will cause positive items always appear at the top of the items which have the same score. Because we have a re-organizing stage to speed up the evaluation. For more information about this trick, you can click evaluation.

Maybe we should keep the randomness of positive items‘ position when we have the same score. We will discuss this problem and try to come up with a feasible solution to alleviate this problem!

@tsotfsk
Copy link
Contributor

tsotfsk commented Dec 27, 2020

Hi! @deklanw. I think randomly generating a small range of numbers and adding them to the score may solve this problem. Because it can avoid some items having the same score. If the model does not learn any information, the result will be very poor. If the model learns some obvious information, the randomly generated small number will not have a great impact on the result, which may help you to determine the range of the L1 parameter.

@deklanw
Copy link
Contributor Author

deklanw commented Dec 27, 2020

@tsotfsk Thanks, that all makes sense.

The adding random noise idea worked well. But, it can't be a temporary solution because I believe the appropriate L1 hyperparameter depends on the problem. I've chosen hyperparameter defaults which work well on ml-100k but that's as far as I can tell.

def add_noise(t, mag=1e-5):
    return t + mag * torch.rand(t.shape)

...

    def predict(self, interaction):
        user = interaction[self.USER_ID].cpu().numpy()
        item = interaction[self.ITEM_ID].cpu().numpy()

        r = torch.from_numpy((self.interaction_matrix[user, :].multiply(
            self.item_similarity[:, item].T)).sum(axis=1).getA1())

        return add_noise(r)

    def full_sort_predict(self, interaction):
        user = interaction[self.USER_ID].cpu().numpy()

        r = self.interaction_matrix[user, :] @ self.item_similarity
        r = torch.from_numpy(r.todense().getA1())

        return add_noise(r)

I'm fine with leaving this in permanently.

Seem fine?

@tsotfsk
Copy link
Contributor

tsotfsk commented Dec 28, 2020

Hi! @deklanw. 😊 your code looks fine, and I tested the HyperOpt module after adding noise and it also works well. The phenomenon of implausible metrics will disappear and the effect of noise on the results is very small and can be ignored.

@deklanw
Copy link
Contributor Author

deklanw commented Dec 28, 2020

Thanks for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants