-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Improve performance of duplicate ID validator #947
[PERF] Improve performance of duplicate ID validator #947
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
chromadb/api/types.py
Outdated
@@ -115,7 +116,8 @@ def validate_ids(ids: IDs) -> IDs: | |||
if not isinstance(id, str): | |||
raise ValueError(f"Expected ID to be a str, got {id}") | |||
if len(ids) != len(set(ids)): | |||
dups = set([x for x in ids if ids.count(x) > 1]) | |||
c = Counter(ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this! Let's change the error message to just enumerate the first 10 duplicates as you suggested and then add ellipsis to truncate so that we aren't dumping potentially so many ids as to overload the console. Also we can do this in one pass if we store the call to set(), as opposed to the currently 3 passes - set(), Counter() and dups = [...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Can you make the suggested changes?
Hi @steve-marmalade would you prefer if I just went ahead and made those changes? |
@HammadB apologies for the delay, will update shortly. |
@HammadB this is ready for another look 🙏 |
8ad1929
to
35f380e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing thank you!
@HammadB please let me know if the test failure is due to my changes, or if there's anything you need from me to address it. On a cursory look, it doesn't appear related to what I've done in this PR. |
@steve-marmalade no worries, thats a flaky test I patched in a subsequent PR that has yet to land. Issue #927 |
Description of changes
Summarize the changes made by this PR.
validate_ids
function, which looks like it might be O(N^2).Test plan
How are these changes tested?
I tested the implementation within a Jupyter notebook on the same list of IDs that I had passed to
collection.add
, and it isolated the dupes in about half a second.I pip installed this branch, tried to add the same items to the collection, and confirmed that I received the error message without a long delay.
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository? No docstrings or documentation changes are needed.