-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create avoid_openai_rate_limits.py #2019
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super cool! We've had a lot of users do this so it'll be nice to have a utility to point them to. I left some comments to help keep this code as readable and usable as possible -- excited to get this checked in!
# If the collection is too big, we will run against the 10,000 requests per minute limit, and get a rate limit error. | ||
# Call collection.add with a slice of 1000 of the data at each time. Then sleep for 45 seconds. | ||
|
||
def add_embeddings(tech_data_list_description, tech_data_list_uuid, tech_data_meta): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just call these documents
, ids
, and metadatas
? That helps keep this script applicable to as many users as possible.
|
||
while len(tech_data_list_description) > 1000: | ||
time_start = time() | ||
print(f"Start Loop: {len(tech_data_list_description)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this will just print Start Loop: 1000
, no? What if we printed indices instead, something like Starting work on items 4000 - 5000
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note, it might be nice to print the total number of documents to embed at the beginning of this function.
while len(tech_data_list_description) > 1000: | ||
time_start = time() | ||
print(f"Start Loop: {len(tech_data_list_description)}") | ||
collection.add( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where we initialize the collection. Don't we need to accept a collection to this function as well?
tech_data_list_description = tech_data_list_description[1000:] | ||
tech_data_list_uuid = tech_data_list_uuid[1000:] | ||
tech_data_meta = tech_data_meta[1000:] | ||
print(f"Technologies To Go: {len(tech_data_list_description)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will always print len(tech_data_list_description) - 1000
since we don't actually shorten the list anywhere. If we update the above log line to print the indices we're currently working on then I think we can remove this log line, what do you think?
tech_data_meta - The metadata. In our case will be the source of the data. | ||
''' | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a small basic sanity check to make sure the length of the three parameters is the same/
Our underlying impl has changed and so this PR is not landable as is. That being said - we'd still like to add this functionality and that is now tracked in this issue. |
Description of changes
Summarize the changes made by this PR.
New functionality
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?