Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

danb27/sar_single_node_improvements: allow for getting most frequent … #1666

Merged
merged 16 commits into from
Apr 11, 2022
Merged

danb27/sar_single_node_improvements: allow for getting most frequent … #1666

merged 16 commits into from
Apr 11, 2022

Conversation

danb27
Copy link
Contributor

@danb27 danb27 commented Mar 9, 2022

Description

  1. Bug fix: The documentation says that self.item_frequencies should contain the frequencies of every item, but in fact, it currently shows a different value, generated by using the diagonal of the response from self.compute_cooccurrence_matrix, which is not actually equal to item frequencies when there are duplicates in the dataframe. The test we had written before (test_get_popularity_based_topk) was only passing because the test it has no duplicates. If you add any duplicates to the test, it will fail, and nothing in the fit() method will stop you or warn you that you are using the object incorrectly. Many of the tests we have written do include duplicates also.
  2. Improvement: Allow users to get similar users using the new method, self.get_user_based_topk, similarly to how users are currently able to get similar items using self.get_item_based_topk
  3. Improvement: Allow users to use the self.get_popularity_based_topk method with items=False to get the most popular users rather than the most popular items. The parameter defaults to True (return items, not users) for backwards compatibility
  4. Add tests for 1-3

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • This PR is being made to staging branch and not to main branch.

Looking forward to your feedback!

…users, similar users, and fix self.item_frequencies to actually use frequencies as stated in documentation
@danb27
Copy link
Contributor Author

danb27 commented Mar 9, 2022

the issue (I'm no longer sure that it is a bug) with the frequencies was introduced here: ##1588

The problem happens when there are duplicate user-item pairs in the dataset. A note was added to the docstring telling the user not to provide duplicates, but there is no warning or error thrown if duplicates are present. If there were an error thrown, then a lot of our current tests would fail also.

Without any warning or without dropping duplicates, you can get weird results like this when the user DOES have duplicates:

image

I added a warning if there are duplicates found in the dataset, but I think we can still improve by changing how we calculate frequency to be unrelated to the cooccurrence matrix (just count occurrences in the user provided dataframe).

Any thoughts?

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2022

Codecov Report

Merging #1666 (7038a14) into staging (b7d79ab) will increase coverage by 23.33%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##           staging    #1666       +/-   ##
============================================
+ Coverage     0.00%   23.33%   +23.33%     
============================================
  Files           88       87        -1     
  Lines         9121     9132       +11     
============================================
+ Hits             0     2131     +2131     
- Misses           0     7001     +7001     
Flag Coverage Δ
nightly ?
pr-gate 23.33% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
recommenders/models/sar/sar_singlenode.py 97.22% <100.00%> (+97.22%) ⬆️
recommenders/datasets/mind.py 0.00% <0.00%> (ø)
recommenders/datasets/movielens.py 66.37% <0.00%> (+66.37%) ⬆️
recommenders/utils/python_utils.py 97.50% <0.00%> (+97.50%) ⬆️
recommenders/datasets/download_utils.py 90.00% <0.00%> (+90.00%) ⬆️
recommenders/models/newsrec/io/mind_iterator.py 0.00% <0.00%> (ø)
recommenders/models/newsrec/models/base_model.py 0.00% <0.00%> (ø)
recommenders/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b7d79ab...7038a14. Read the comment docs.

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danb27 this is fantastic work, I added some questions and suggestions. I'll add more people to review because the work you are doing is super good.

Something I would like to ask Dan, please be patient with the review process, SAR is a key algo for us that has a lot of implications, so it will take us some time to review

recommenders/models/sar/sar_singlenode.py Outdated Show resolved Hide resolved
"""
if sum(df[[self.col_user, self.col_item]].duplicated()) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danb27 @anargyri @simonzhaoms @loomlike @gramhagen hey can we have a discussion on whether we want to have a warning or throw an error when there are duplicates? What are your perspectives?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either option is better than just having a docstring. If there is a use case for running SAR with duplicates, then I think warning is best. But, if we believe that the algorithm should NEVER see duplicates, then I would choose error.

If we do choose to go with error, we will need to fix multiple tests. If we go with warning, we don't need to fix the tests (but still probably should IMO). I was just hesitant to change a bunch of tests without first having this discussion.

Let me know what you all think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have duplicates, will we be modifying artificially the coocurrence matrix? if that's the case, then we can throw an error.
FYI @anargyri

Copy link
Contributor Author

@danb27 danb27 Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently, yes duplicates will give us weird results in the cooccurrence matrix. We can fix this by simply removing duplicates from a copy of the dataframe before calculating cooccurrence matrix like this:

df = df.copy().drop_duplicates(subset=[self.col_user_id, self.col_item_id])

that way, the user can provide duplicates, it wont harm our cooccurrence matrix, our user and item frequencies will still be accurate (including duplicates, if there are any) because of my new way of retrieving the frequencies, and all the unit tests will still pass.

If we do not do this, then the problem is that all unit tests that use this dataset: https://github.com/microsoft/recommenders/blob/main/tests/conftest.py#L96 will fail, which seems very suspicious to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danb27 for this I would say that it is better to check whether there are duplicates, and if so, throw an error.
In my experience df.duplicated().any() is faster than any(df.duplicated()) but not sure if your implementation is faster. In sum(df[[self.col_user, self.col_item]].duplicated()) I would consider also ratings column. I can see a case where the ratings are related to clicks and purchase, where both signals are valid, so we want to keep both.

Copy link
Contributor Author

@danb27 danb27 Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miguelgfierro The csv file with duplicates is located at https://recodatasets.z20.web.core.windows.net/sarunittest/demoUsage.csv

I tested locally after deduping with this code:

df = df.sort_values(
    "Timestamp", ascending=False
)
df = df.drop_duplicates(
    ["UserId", "MovieId", "Timestamp"],
    keep="first"
)

which is enough to pass all of the unit tests with the code I just pushed. The tests will fail for now, but once you get the deduped version up in storage I can swap the link and everything should be ready to go.

alternatively, I could just add the above code when we load the data here: https://github.com/microsoft/recommenders/blob/main/tests/conftest.py#L139 or to the tests that use the data, but I think those options would be worse because in my mind the test dataset should be ready to go. Do you have a preference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go: https://recodatasets.z20.web.core.windows.net/sarunittest/demoUsageNoDups.csv

Agree with you, probably better to use directly the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting this error: urllib.error.HTTPError: HTTP Error 404: The requested content does not exist. with the above link

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't set the right permissions. Now it should work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests passed locally! PR is hopefully good to go now after latest commit.

recommenders/models/sar/sar_singlenode.py Outdated Show resolved Hide resolved
Comment on lines 493 to 505
def get_user_based_topk(self, users, top_k=10, sort_top_k=True):
"""Get top K similar users to provided seed users based on similarity metric defined.
This method will take a set of users and use them to recommend the most similar users to that set
based on the similarity matrix fit during training.

Args:
users (pandas.DataFrame): DataFrame with user, item (optional), and rating (optional) columns
top_k (int): number of top users to recommend
sort_top_k (bool): flag to sort top k results

Returns:
pandas.DataFrame: similar users to the users provided in users
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question @danb27, the computation above of the user coocurrence is only used when we want to use the method get_user_based_topk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. in fit(), if we set compute_user_similarity=True, user_cooccurrence will be calculated and used to create self.user_similarity, which is required to run get_user_based_topk.

I went back and forth on whether to compute user similarity always, but decided to create the compute_user_similarity parameter so that the user could keep performance the same as before the PR if they do not require the added functionality.

Happy to have my mind changed if you all would prefer to have this computation occur every time so that you can always call any of the methods, regardless of how you fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the change required and the performance difference if we do all the computation of user_coocurrence inside the function get_user_based_topk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we would calculate both user_coocurrence and self.user_similarity (they will be different unless 'coocurrence' is the similarity type) the first time get_user_based_topk is used by checking if self.user_similarity is None, but there are a couple problems with that:

  • the first time you call the method, it would take a lot longer than the subsequent times* (with a non-trivial dataset it can take some time to calculate user similarity)
  • the behavior of the first call to the method would be significantly slower than calling the get_item_based_topk method, this might be perceived as a bug.
  • I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.

Ideally, it would be great that given a list of user_ids (or DF), the system returns a DF (or list of list, whatever) but the top k most similar users for each input user_id in the list. It looks that is not happening, right?

Some questions:

  1. How can we create this result?
  2. Right now we can input a list of user_ids, what is the meaning of the output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I call get_item_based_topk() like this:

for item_id in item_ids:
    temp_df = pd.DataFrame({'item_id': [item]}, dtype=DTYPE)
    similar_items_df = model.get_item_based_topk(temp_df, k=K)
    ...

With my PR, I would be able to do something very similar for get_user_based_topk()

We could implement this as a different method, because what currently occurs if you give this method a list of user ids is the following:

  1. From the docstring of get_item_based_topk(): " This method will take a set of items and use them to recommend the most similar items to that set based on the similarity matrix fit during training. This allows recommendations for cold-users (unseen during training), note - the model is not updated." The same is true for the new get_user_based_topk() method, it returns the most similar users given a set of other users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be valuable to have both item and user similar. Given a list of items, I would like to get a list of the most similar ones for each item, and given a list of users, I would like to get a list of the most similar users for each user.

@danb27
Copy link
Contributor Author

danb27 commented Mar 15, 2022

@miguelgfierro Thanks for the initial review! I agree with all of your comments and will push a commit once we resolve some of the open questions with yourself and the other engineers.

re: needing time to review - No problem at all! I am a big fan of the package and want to make sure we get this right!

"""
if sum(df[[self.col_user, self.col_item]].duplicated()) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have duplicates, will we be modifying artificially the coocurrence matrix? if that's the case, then we can throw an error.
FYI @anargyri

tests/unit/recommenders/models/test_sar_singlenode.py Outdated Show resolved Hide resolved
Comment on lines 493 to 505
def get_user_based_topk(self, users, top_k=10, sort_top_k=True):
"""Get top K similar users to provided seed users based on similarity metric defined.
This method will take a set of users and use them to recommend the most similar users to that set
based on the similarity matrix fit during training.

Args:
users (pandas.DataFrame): DataFrame with user, item (optional), and rating (optional) columns
top_k (int): number of top users to recommend
sort_top_k (bool): flag to sort top k results

Returns:
pandas.DataFrame: similar users to the users provided in users
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am envisioning the use case (which is my current use case) where someone wants to store similar users for every user in their dataset, so it is necessary to call this method n_users times.

Ideally, it would be great that given a list of user_ids (or DF), the system returns a DF (or list of list, whatever) but the top k most similar users for each input user_id in the list. It looks that is not happening, right?

Some questions:

  1. How can we create this result?
  2. Right now we can input a list of user_ids, what is the meaning of the output?

tests/unit/recommenders/models/test_sar_singlenode.py Outdated Show resolved Hide resolved
@danb27
Copy link
Contributor Author

danb27 commented Mar 21, 2022

@miguelgfierro Sorry for the back and forth, but your comments made me realize a much simpler + better way to achieve what I was trying to do. Here is a summary of what the PR does now:

  • change how we calculate user + item frequencies to come directly from the user's data
  • remove need for user to drop duplicates ahead of calling .fit(). The issue was when we calculated item cooccurrences and there were duplicates, but we can instead just temporarily drop any duplicates in the user dataframe before calculating cooccurrences. this ensures that our unit tests are still working AND our cooccurrences are accurate
  • allow someone to retrieve the most frequent users, the same way we can already retrieve the most popular items
  • allow someone to retrieve the k most similar users based on our user affinity matrix. This allows ratings to be factored into the results of grabbing the most similar users and also avoids the weird situation I had created where we needed to calculate user cooccurrences if we wanted to call a specific method.

I think these changes are a lot better and hopefully make the PR more palatable.

What do you think?

Is there any reason we should still tell users to not include duplicates? I am assuming that if our unit test dataset has duplicates, then we should also be accepting duplicates, but I don't know where this test data comes from.

@anargyri
Copy link
Collaborator

Hi @danb27 and thanks for your contributions! I think there is some more background info about the issues you and @miguelgfierro are discussing.

  1. About the item frequencies, there are two cases we provide for: data without timestamps (just users, items and ratings) and data with timestamps. In the latter case, i.e. when the data scientist enables the time decay flag, we take the time decay into account for computing item frequencies, see https://github.com/microsoft/recommenders/blob/c4435a9af5836f3d472cfa44b312841a8121923c/recommenders/models/sar/sar_singlenode.py#L231 and afterwards.

Firstly, your code uses value_counts() for computing the item frequencies, which makes sense in the former case but not in the latter.

Secondly, we chose to require that the data frame has no duplicates. In the former case, this means that (users, items) pairs are unique, in the latter that (users, items, timestamps) are unique.

There are a few reasons for this. One is that having duplicates in the DF almost certainly is a sign of bad practice during data preparation. The other is that it is more efficient to incur the cost of aggregation / deduplication just once (during data prep) and not every time the SAR code is run. Yet another is flexibility for defining the ratings (also called weights); in the case of duplicates, the way these should be aggregated for each user - item should be left to the discretion of the data scientist, since there is no single correct way of doing this -- it depends on the recommendation scenario.

So there is a trade-off between people missing the requirement in the doc string vs. computational overhead for aggregation / checking for duplicates vs. flexibility. I think we can live with the former, but probably we should emphasize the requirement more, especially in the notebooks.

  1. I am curious, is there a recommendations use case for this feature? Similar items is a very common scenario that can be found in many e-commerce websites, but I have never seen a scenario with similar users. Have you encountered this in practice? Also bear in mind that in almost all recommendation scenarios the users are anonymized and there should not be any user information leaking to other users.

  2. Similarly as above, why would users want to see "popular users" and is this allowed anyway?

@danb27
Copy link
Contributor Author

danb27 commented Mar 22, 2022

@anargyri Thank you for your clarification on point 1. This is very helpful. I will think about if I can adapt my code or if I should just change those parts back to how they were before.

  1. Of course this could be misused if you are leaking information to other users. But I believe there are legitimate use cases also. For example, you want to make sure that your model has some overlap in the recommendations for similar users, but that similar users are not getting the exact same recommendations. Please note that this sort of functionality is featured within recommenders for LightFM, see the LightFM notebook and the code

  2. You might want to look at how your model performs for users with a lot of data points vs. very few to understand your model biases and at what number of datapoints you know enough about a user to make good recommendations.

@danb27
Copy link
Contributor Author

danb27 commented Mar 24, 2022

So given Andreas' comment on the frequencies, I went back to the old way of calculating item_frequencies, without using value_counts, and also rewrote how we calculate user_frequencies to not use the dataframe directly (this variable is only needed if the answer to question 2 below is yes).

I think there are two remaining questions:

  1. do we want to have some sort of warning / error if a data scientist does provide duplicates?
  2. are we okay with providing these methods for grabbing the most frequent users and grabbing similar users, similar to the functionality provided in the LightFM code? Of course these could be misused, but I think there are legitimate uses for them as well.

If the answer to both is no, then we can probably just close this PR.

@miguelgfierro @anargyri Please let me know how to proceed when you get a chance.

@miguelgfierro
Copy link
Collaborator

@danb27 sorry for not answering earlier. Answering your questions:

do we want to have some sort of warning / error if a data scientist does provide duplicates?

I think we should throw an error if there are duplicates, see details of faster implementations in this comment

are we okay with providing these methods for grabbing the most frequent users and grabbing similar users, similar to the functionality provided in the LightFM code? Of course these could be misused, but I think there are legitimate uses for them as well.

I think this is ok.

Any other perspective @anargyri?

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR is good to go, minor addition for Dan.

@anargyri is there anything else?

@miguelgfierro miguelgfierro requested a review from yueguoguo as a code owner April 8, 2022 08:27
Copy link
Collaborator

@anargyri anargyri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above about avoiding the dense matrix.

@miguelgfierro miguelgfierro merged commit 4106333 into recommenders-team:staging Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants