-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix model download for ONNX embedder #976
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
For more context on the problem this solves: In my software, I am using the ONNX embedder, but we don't want to make it download every time someone launches the container, so we changed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check if the mode is correctly downloaded, can we check for the presence of all the files in the tar and validate their integrity?
config.json
model.onnx
special_tokens_map.json
tokenizer_config.json
tokenizer.json
vocab.txt
?
There can be situations where the extraction fails part way and then the model files will partially there leading to errors.
Additionally, can we store the md5 hash of each file and verify it is correct
So logic becomes
If all the files exist
- Check their hashes
- If OK: nothing
- If not: delete the partial files, proceed to next line
If all the files don't exist / partial
- Clean up partial files if any
- Check for archive, if no archive, download it.
- Extract all files, validate hash.
Those sound like great ideas for another PR. This fix satisfies the need to download the model properly, additional features would make more sense in another PR, right? |
I don't think another PR 'makes more sense' - if you do not want to make those changes I am happy to make them instead. This only half way fixes the problem, since a partial extract will still fail. I'd rather we simply address the problem properly. I'll submit a PR with the fix later this week. |
I think you have to checksum the individual files in addition to the tar because the extraction can be partial. |
As a more "brute-force" approach, one can simply checksum tarball and extract every time. (possibly remove any model dir that may exist). |
I just added an iteration for checking if each file exists in |
Thanks @Josh-XT this change makes sense to me incrementally. We can add checksumming as an extra step in a subsequent PR. Much appreciated! |
Description of changes
The current function is looking for the tar.gz file instead of checking if the folder already exists, so if the tar.gz gets deleted after extraction, it downloads it again.. This PR resolves this and checks for the model in the extracted folder before attempting to download or extract again.
Test plan
By using it
Documentation Changes
I didn't find any documentation about how this does the download.