Concerns regrading data sampling #3

Miopha · 2024-11-29T06:30:23Z

Hi, I really appreciate your work and have following issues:
I have observed that when I turned off federated learning (i.e., diasable downloading item embeddings from the server), specifically when I modified engine.py as follows:

# user_param_dict['embedding_item.weight'] = copy.deepcopy(self.server_model_param['embedding_item.weight'].data).cuda() user_param_dict['embedding_item.weight'] = user_param_dict['embedding_item.weight'].cuda()

and then i got a wonderful performance (HR@10 = 1.0000, NDCG@10 = 0.9775), which is abnormally better than the proposed method.
So I tried to dig deeper into the issues and found that the negative items (used in the training process) were not being sampled in test set, code:

self.negatives = self._sample_negative(self.ratings)
def _sample_negative(self, ratings):
interact_status = ratings.groupby('userId')['itemId'].apply(set).reset_index().rename(columns={'itemId': 'interacted_items'})
interact_status['negative_items'] = interact_status['interacted_items'].apply(lambda x: self.item_pool - x)
interact_status['negative_samples'] = interact_status['negative_items'].apply(lambda x: random.sample(x, 198))
return interact_status[['userId', 'negative_items', 'negative_samples']]

where the self.ratings contains all ratings (include test, train, val) of clients, since client in training process does not have access the test (or validation) data, so this potentially lead to data leakage? The correct code should be

self.negatives = self._sample_negative(self.train_ratings)

I'm not sure if I may have something missed, so I would appreciate any advice.

The text was updated successfully, but these errors were encountered:

Miopha · 2024-11-29T12:57:03Z

And [1][2] seem to have the same issue, as they share very similar code for data sampling, which may also involve data leakage.
[1] GPFedRec: Graph-Guided Personalization for Federated Recommendation. KDD 2024
[2] Federated recommendation with additive personalization. ICLR 2024

Zhangcx19 · 2024-11-30T07:14:02Z

Thank you for your interest in our work! I am glad to clarify this.

In your experimental setting, users train the recommendation model independently using their personal training data. As the number of training iterations increases, the model observes an increasing number of negative data samples. This may lead to overfitting, where observed negative samples are inferred with low scores, and unseen test items end up being ranked relatively high and achieving abnormally good performance. This issue is more likely to occur when the total number of items is small (e.g., ml-100k), but it is less likely to arise when the dataset is larger and more representative of real-world recommendation scenarios (e.g., lastfm-2k).

I would like to clarify that this overfitting phenomenon does not constitute data leakage. It can be alleviated by incorporating larger, more realistic recommendation datasets.

Additionally, in your proposed setup, the test item will be treated as a negative sample during training, which can lead to unstable training because the same item has conflicting labels (positive in the test set, negative during training).

I hope this explanation helps address your concerns!

1180301026 · 2024-12-02T02:12:29Z

I observed the same issue and believe there is a significant data leakage, stemming from both the experimental setup and model design in the paper. In this study, the item is represented solely by its ID embedding, which results in test set items being present in the training set. This leakage leads to inflated model performance, with the model performing even better as more negative samples are leaked. This is evident in the 100% HR achieved on the ML-100K dataset when federated sharing is removed. From the model design perspective, in this paper the recommendation problem is essentially reduced to a binary classification task, where using only ID embeddings for items inevitably causes data leakage.

Miopha · 2024-12-04T02:12:58Z

I've evaluated the performance on both the test and validation sets, and based on this, I don't believe the model was overfitting. What I argue is that the test set should not be accessible during the training process; if it were, you would be effectively recommending it to the client. Consequently, in my opinion, conflicting labels can occur in RS.

gsy901 · 2024-12-16T03:34:43Z

I've observed the same problem, is there any solution?

boring-orange · 2024-12-17T12:20:42Z

I have encountered the same issue. How does the negative sampling strategy guarantee that the negative samples used in the test phase have not been included in the training phase?🧐

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerns regrading data sampling #3

Concerns regrading data sampling #3

Miopha commented Nov 29, 2024 •

edited

Loading

Miopha commented Nov 29, 2024

Zhangcx19 commented Nov 30, 2024

1180301026 commented Dec 2, 2024

Miopha commented Dec 4, 2024 •

edited

Loading

gsy901 commented Dec 16, 2024

boring-orange commented Dec 17, 2024

Concerns regrading data sampling #3

Concerns regrading data sampling #3

Comments

Miopha commented Nov 29, 2024 • edited Loading

Miopha commented Nov 29, 2024

Zhangcx19 commented Nov 30, 2024

1180301026 commented Dec 2, 2024

Miopha commented Dec 4, 2024 • edited Loading

gsy901 commented Dec 16, 2024

boring-orange commented Dec 17, 2024

Miopha commented Nov 29, 2024 •

edited

Loading

Miopha commented Dec 4, 2024 •

edited

Loading