Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix model download for ONNX embedder #976

Merged
merged 6 commits into from
Aug 15, 2023
Merged

Conversation

Josh-XT
Copy link
Contributor

@Josh-XT Josh-XT commented Aug 13, 2023

Description of changes

The current function is looking for the tar.gz file instead of checking if the folder already exists, so if the tar.gz gets deleted after extraction, it downloads it again.. This PR resolves this and checks for the model in the extracted folder before attempting to download or extract again.

Test plan

By using it

Documentation Changes

I didn't find any documentation about how this does the download.

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

@Josh-XT
Copy link
Contributor Author

Josh-XT commented Aug 13, 2023

For more context on the problem this solves:

In my software, I am using the ONNX embedder, but we don't want to make it download every time someone launches the container, so we changed the DOWNLOAD_PATH to our software folder and downloaded the tar.gz and extracted it in the folder in order to make it work. This PR will make it to where if you have either the tar.gz or the onnx ( or EXTRACTED_FOLDER_NAME ) folder in the DOWNLOAD_PATH location, it will continue on to initializing the model. Currently you need to have both for it to work properly.

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check if the mode is correctly downloaded, can we check for the presence of all the files in the tar and validate their integrity?

config.json
model.onnx
special_tokens_map.json
tokenizer_config.json
tokenizer.json
vocab.txt

?

There can be situations where the extraction fails part way and then the model files will partially there leading to errors.

Additionally, can we store the md5 hash of each file and verify it is correct

So logic becomes

If all the files exist
- Check their hashes
- If OK: nothing
- If not: delete the partial files, proceed to next line
If all the files don't exist / partial
- Clean up partial files if any
- Check for archive, if no archive, download it.
- Extract all files, validate hash.

@Josh-XT
Copy link
Contributor Author

Josh-XT commented Aug 14, 2023

Those sound like great ideas for another PR. This fix satisfies the need to download the model properly, additional features would make more sense in another PR, right?

@HammadB
Copy link
Collaborator

HammadB commented Aug 14, 2023

I don't think another PR 'makes more sense' - if you do not want to make those changes I am happy to make them instead.

This only half way fixes the problem, since a partial extract will still fail. I'd rather we simply address the problem properly. I'll submit a PR with the fix later this week.

@tazarov
Copy link
Contributor

tazarov commented Aug 14, 2023

@HammadB , @Josh-XT , do you think it is reasonable to do the following steps in the respective order of "difficulty"

  • Attempt to load the model from the expected path
  • If the above fails, checksum the file and extract
  • If the above fails, download the file, checksum, and extract it

@HammadB
Copy link
Collaborator

HammadB commented Aug 14, 2023

I think you have to checksum the individual files in addition to the tar because the extraction can be partial.

@tazarov
Copy link
Contributor

tazarov commented Aug 14, 2023

I think you have to checksum the individual files in addition to the tar because the extraction can be partial.

As a more "brute-force" approach, one can simply checksum tarball and extract every time. (possibly remove any model dir that may exist).

@Josh-XT
Copy link
Contributor Author

Josh-XT commented Aug 14, 2023

I just added an iteration for checking if each file exists in _download_model_if_not_exists. I didn't add anything for checksum checking. I wasn't saying that it should be another PR out of laziness, just that it isn't functionality that exists currently or would make this PR any more useful (in my opinion at least) to do the checksum part. As someone using this embedder, I'd personally prefer to not have to worry about that hash changing and me needing to scramble to update my software to accomodate changes made elsewhere.

@HammadB
Copy link
Collaborator

HammadB commented Aug 14, 2023

Thanks @Josh-XT this change makes sense to me incrementally. We can add checksumming as an extra step in a subsequent PR. Much appreciated!

@HammadB HammadB enabled auto-merge (squash) August 14, 2023 23:54
@HammadB HammadB disabled auto-merge August 15, 2023 00:53
@HammadB HammadB enabled auto-merge (squash) August 15, 2023 01:17
@HammadB HammadB merged commit f1af776 into chroma-core:main Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants