-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
danb27/sar_single_node_improvements: allow for getting most frequent … #1666
danb27/sar_single_node_improvements: allow for getting most frequent … #1666
Conversation
…users, similar users, and fix self.item_frequencies to actually use frequencies as stated in documentation
the issue (I'm no longer sure that it is a bug) with the frequencies was introduced here: ##1588 The problem happens when there are duplicate user-item pairs in the dataset. A note was added to the docstring telling the user not to provide duplicates, but there is no warning or error thrown if duplicates are present. If there were an error thrown, then a lot of our current tests would fail also. Without any warning or without dropping duplicates, you can get weird results like this when the user DOES have duplicates: I added a warning if there are duplicates found in the dataset, but I think we can still improve by changing how we calculate frequency to be unrelated to the cooccurrence matrix (just count occurrences in the user provided dataframe). Any thoughts? |
…cate user-item pairs
Codecov Report
@@ Coverage Diff @@
## staging #1666 +/- ##
============================================
+ Coverage 0.00% 23.33% +23.33%
============================================
Files 88 87 -1
Lines 9121 9132 +11
============================================
+ Hits 0 2131 +2131
- Misses 0 7001 +7001
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danb27 this is fantastic work, I added some questions and suggestions. I'll add more people to review because the work you are doing is super good.
Something I would like to ask Dan, please be patient with the review process, SAR is a key algo for us that has a lot of implications, so it will take us some time to review
""" | ||
if sum(df[[self.col_user, self.col_item]].duplicated()) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danb27 @anargyri @simonzhaoms @loomlike @gramhagen hey can we have a discussion on whether we want to have a warning or throw an error when there are duplicates? What are your perspectives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either option is better than just having a docstring. If there is a use case for running SAR with duplicates, then I think warning is best. But, if we believe that the algorithm should NEVER see duplicates, then I would choose error.
If we do choose to go with error, we will need to fix multiple tests. If we go with warning, we don't need to fix the tests (but still probably should IMO). I was just hesitant to change a bunch of tests without first having this discussion.
Let me know what you all think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have duplicates, will we be modifying artificially the coocurrence matrix? if that's the case, then we can throw an error.
FYI @anargyri
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently, yes duplicates will give us weird results in the cooccurrence matrix. We can fix this by simply removing duplicates from a copy of the dataframe before calculating cooccurrence matrix like this:
df = df.copy().drop_duplicates(subset=[self.col_user_id, self.col_item_id])
that way, the user can provide duplicates, it wont harm our cooccurrence matrix, our user and item frequencies will still be accurate (including duplicates, if there are any) because of my new way of retrieving the frequencies, and all the unit tests will still pass.
If we do not do this, then the problem is that all unit tests that use this dataset: https://github.com/microsoft/recommenders/blob/main/tests/conftest.py#L96 will fail, which seems very suspicious to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danb27 for this I would say that it is better to check whether there are duplicates, and if so, throw an error.
In my experience df.duplicated().any()
is faster than any(df.duplicated())
but not sure if your implementation is faster. In sum(df[[self.col_user, self.col_item]].duplicated())
I would consider also ratings column. I can see a case where the ratings are related to clicks and purchase, where both signals are valid, so we want to keep both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miguelgfierro The csv file with duplicates is located at https://recodatasets.z20.web.core.windows.net/sarunittest/demoUsage.csv
I tested locally after deduping with this code:
df = df.sort_values(
"Timestamp", ascending=False
)
df = df.drop_duplicates(
["UserId", "MovieId", "Timestamp"],
keep="first"
)
which is enough to pass all of the unit tests with the code I just pushed. The tests will fail for now, but once you get the deduped version up in storage I can swap the link and everything should be ready to go.
alternatively, I could just add the above code when we load the data here: https://github.com/microsoft/recommenders/blob/main/tests/conftest.py#L139 or to the tests that use the data, but I think those options would be worse because in my mind the test dataset should be ready to go. Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you go: https://recodatasets.z20.web.core.windows.net/sarunittest/demoUsageNoDups.csv
Agree with you, probably better to use directly the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting this error: urllib.error.HTTPError: HTTP Error 404: The requested content does not exist.
with the above link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't set the right permissions. Now it should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests passed locally! PR is hopefully good to go now after latest commit.
def get_user_based_topk(self, users, top_k=10, sort_top_k=True): | ||
"""Get top K similar users to provided seed users based on similarity metric defined. | ||
This method will take a set of users and use them to recommend the most similar users to that set | ||
based on the similarity matrix fit during training. | ||
|
||
Args: | ||
users (pandas.DataFrame): DataFrame with user, item (optional), and rating (optional) columns | ||
top_k (int): number of top users to recommend | ||
sort_top_k (bool): flag to sort top k results | ||
|
||
Returns: | ||
pandas.DataFrame: similar users to the users provided in users | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question @danb27, the computation above of the user coocurrence is only used when we want to use the method get_user_based_topk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. in fit()
, if we set compute_user_similarity=True
, user_cooccurrence
will be calculated and used to create self.user_similarity
, which is required to run get_user_based_topk
.
I went back and forth on whether to compute user similarity always, but decided to create the compute_user_similarity
parameter so that the user could keep performance the same as before the PR if they do not require the added functionality.
Happy to have my mind changed if you all would prefer to have this computation occur every time so that you can always call any of the methods, regardless of how you fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the change required and the performance difference if we do all the computation of user_coocurrence inside the function get_user_based_topk
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, we would calculate both user_coocurrence
and self.user_similarity
(they will be different unless 'coocurrence' is the similarity type) the first time get_user_based_topk
is used by checking if self.user_similarity is None
, but there are a couple problems with that:
- the first time you call the method, it would take a lot longer than the subsequent times* (with a non-trivial dataset it can take some time to calculate user similarity)
- the behavior of the first call to the method would be significantly slower than calling the
get_item_based_topk
method, this might be perceived as a bug.
- I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.
Ideally, it would be great that given a list of user_ids (or DF), the system returns a DF (or list of list, whatever) but the top k most similar users for each input user_id in the list. It looks that is not happening, right?
Some questions:
- How can we create this result?
- Right now we can input a list of user_ids, what is the meaning of the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I call get_item_based_topk()
like this:
for item_id in item_ids:
temp_df = pd.DataFrame({'item_id': [item]}, dtype=DTYPE)
similar_items_df = model.get_item_based_topk(temp_df, k=K)
...
With my PR, I would be able to do something very similar for get_user_based_topk()
We could implement this as a different method, because what currently occurs if you give this method a list of user ids is the following:
- From the docstring of
get_item_based_topk()
: " This method will take a set of items and use them to recommend the most similar items to that set based on the similarity matrix fit during training. This allows recommendations for cold-users (unseen during training), note - the model is not updated." The same is true for the newget_user_based_topk()
method, it returns the most similar users given a set of other users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be valuable to have both item and user similar. Given a list of items, I would like to get a list of the most similar ones for each item, and given a list of users, I would like to get a list of the most similar users for each user.
@miguelgfierro Thanks for the initial review! I agree with all of your comments and will push a commit once we resolve some of the open questions with yourself and the other engineers. re: needing time to review - No problem at all! I am a big fan of the package and want to make sure we get this right! |
""" | ||
if sum(df[[self.col_user, self.col_item]].duplicated()) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have duplicates, will we be modifying artificially the coocurrence matrix? if that's the case, then we can throw an error.
FYI @anargyri
def get_user_based_topk(self, users, top_k=10, sort_top_k=True): | ||
"""Get top K similar users to provided seed users based on similarity metric defined. | ||
This method will take a set of users and use them to recommend the most similar users to that set | ||
based on the similarity matrix fit during training. | ||
|
||
Args: | ||
users (pandas.DataFrame): DataFrame with user, item (optional), and rating (optional) columns | ||
top_k (int): number of top users to recommend | ||
sort_top_k (bool): flag to sort top k results | ||
|
||
Returns: | ||
pandas.DataFrame: similar users to the users provided in users | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.
Ideally, it would be great that given a list of user_ids (or DF), the system returns a DF (or list of list, whatever) but the top k most similar users for each input user_id in the list. It looks that is not happening, right?
Some questions:
- How can we create this result?
- Right now we can input a list of user_ids, what is the meaning of the output?
@miguelgfierro Sorry for the back and forth, but your comments made me realize a much simpler + better way to achieve what I was trying to do. Here is a summary of what the PR does now:
I think these changes are a lot better and hopefully make the PR more palatable. What do you think? Is there any reason we should still tell users to not include duplicates? I am assuming that if our unit test dataset has duplicates, then we should also be accepting duplicates, but I don't know where this test data comes from. |
Hi @danb27 and thanks for your contributions! I think there is some more background info about the issues you and @miguelgfierro are discussing.
Firstly, your code uses Secondly, we chose to require that the data frame has no duplicates. In the former case, this means that (users, items) pairs are unique, in the latter that (users, items, timestamps) are unique. There are a few reasons for this. One is that having duplicates in the DF almost certainly is a sign of bad practice during data preparation. The other is that it is more efficient to incur the cost of aggregation / deduplication just once (during data prep) and not every time the SAR code is run. Yet another is flexibility for defining the ratings (also called weights); in the case of duplicates, the way these should be aggregated for each user - item should be left to the discretion of the data scientist, since there is no single correct way of doing this -- it depends on the recommendation scenario. So there is a trade-off between people missing the requirement in the doc string vs. computational overhead for aggregation / checking for duplicates vs. flexibility. I think we can live with the former, but probably we should emphasize the requirement more, especially in the notebooks.
|
@anargyri Thank you for your clarification on point 1. This is very helpful. I will think about if I can adapt my code or if I should just change those parts back to how they were before.
|
So given Andreas' comment on the frequencies, I went back to the old way of calculating item_frequencies, without using value_counts, and also rewrote how we calculate user_frequencies to not use the dataframe directly (this variable is only needed if the answer to question 2 below is yes). I think there are two remaining questions:
If the answer to both is no, then we can probably just close this PR. @miguelgfierro @anargyri Please let me know how to proceed when you get a chance. |
@danb27 sorry for not answering earlier. Answering your questions:
I think we should throw an error if there are duplicates, see details of faster implementations in this comment
I think this is ok. Any other perspective @anargyri? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PR is good to go, minor addition for Dan.
@anargyri is there anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment above about avoiding the dense matrix.
Description
self.item_frequencies
should contain the frequencies of every item, but in fact, it currently shows a different value, generated by using the diagonal of the response fromself.compute_cooccurrence_matrix
, which is not actually equal to item frequencies when there are duplicates in the dataframe. The test we had written before (test_get_popularity_based_topk
) was only passing because the test it has no duplicates. If you add any duplicates to the test, it will fail, and nothing in the fit() method will stop you or warn you that you are using the object incorrectly. Many of the tests we have written do include duplicates also.self.get_user_based_topk
, similarly to how users are currently able to get similar items usingself.get_item_based_topk
self.get_popularity_based_topk
method withitems=False
to get the most popular users rather than the most popular items. The parameter defaults to True (return items, not users) for backwards compatibilityChecklist:
staging branch
and not tomain branch
.Looking forward to your feedback!