Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Improve performance of duplicate ID validator #947

Conversation

steve-marmalade
Copy link
Contributor

@steve-marmalade steve-marmalade commented Aug 8, 2023

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Improves the performance of the duplicate ID validator. I noticed this while trying to add > 100k records to a collection. I had duplicates in my ID array (which I didn't know at the time), and I eventually terminated the operation because it was taking so long. I happened to notice in the traceback that it was stuck in the validate_ids function, which looks like it might be O(N^2).
    • This PR currently changes just the implementation. That said, it would be easy to display only the e.g. top 10 most common dupes (so as to avoid printing over 1k IDs to console output, as happened in my case 🙃 ). Let me know if you would prefer that.

Test plan

How are these changes tested?

  1. I tested the implementation within a Jupyter notebook on the same list of IDs that I had passed to collection.add, and it isolated the dupes in about half a second.
    image

  2. I pip installed this branch, tried to add the same items to the collection, and confirmed that I received the error message without a long delay.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository? No docstrings or documentation changes are needed.

@github-actions
Copy link

github-actions bot commented Aug 8, 2023

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

@@ -115,7 +116,8 @@ def validate_ids(ids: IDs) -> IDs:
if not isinstance(id, str):
raise ValueError(f"Expected ID to be a str, got {id}")
if len(ids) != len(set(ids)):
dups = set([x for x in ids if ids.count(x) > 1])
c = Counter(ids)
Copy link
Collaborator

@HammadB HammadB Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this! Let's change the error message to just enumerate the first 10 duplicates as you suggested and then add ellipsis to truncate so that we aren't dumping potentially so many ids as to overload the console. Also we can do this in one pass if we store the call to set(), as opposed to the currently 3 passes - set(), Counter() and dups = [...]

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Can you make the suggested changes?

@HammadB
Copy link
Collaborator

HammadB commented Aug 10, 2023

Hi @steve-marmalade would you prefer if I just went ahead and made those changes?

@steve-marmalade
Copy link
Contributor Author

Hi @steve-marmalade would you prefer if I just went ahead and made those changes?

@HammadB apologies for the delay, will update shortly.

@steve-marmalade
Copy link
Contributor Author

@HammadB this is ready for another look 🙏

@steve-marmalade steve-marmalade force-pushed the performance/improve-duplicate-id-validator branch from 8ad1929 to 35f380e Compare August 14, 2023 16:58
Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing thank you!

@HammadB HammadB enabled auto-merge (squash) August 14, 2023 18:34
@HammadB HammadB changed the title Improve performance of duplicate ID validator [PERF] Improve performance of duplicate ID validator Aug 14, 2023
@steve-marmalade
Copy link
Contributor Author

@HammadB please let me know if the test failure is due to my changes, or if there's anything you need from me to address it. On a cursory look, it doesn't appear related to what I've done in this PR.

@HammadB
Copy link
Collaborator

HammadB commented Aug 14, 2023

@steve-marmalade no worries, thats a flaky test I patched in a subsequent PR that has yet to land. Issue #927

@HammadB HammadB merged commit a3d0892 into chroma-core:main Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants