-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hub_utils] Support for huggingface model hub #377
Conversation
asteroid/utils/hub_utils.py
Outdated
# Note to maintainers: | ||
# You can remove the `return hf_get_from_cache(…)` line above | ||
# if you want to keep the exact same file downloading/caching behavior | ||
# as the current one. In which case, you can remove all functions below | ||
# except for `hf_bucket_url`. | ||
# However the implementation adds some nice features | ||
# (notably versioning-aware caching) so I'd suggest keeping it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll keep hf_get_from_cache
because we'll drop Zenodo support as soon as our models are fully migrated to HuggingFace's hub.
And we'll add a note about this behaviour in the docstring above.
Thanks for the note!
Thank you so much for this @julien-c, this is really exciting!
This would make complete sense at some point, but the amount of copy/paste is fine, we'll keep everything. We need to somehow be able to maintain that code so I'll have to read it carefully, I don't grasp everything yet 🙃 |
# e.g. julien-c/DPRNNTasNet-ks16_WHAM_sepclean is a valid model id | ||
# and julien-c/DPRNNTasNet-ks16_WHAM_sepclean@main supports specifying a commit/branch/tag. | ||
if "@" in filename_or_url: | ||
model_id = filename_or_url.split("@")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you know model_id, revision = filename_or_url.split("@")
? :) (assuming exactly 1 @
)
url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout | ||
) | ||
r.raise_for_status() | ||
etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell why the custom header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of an LFS file (i.e. where the HTTP request does not return the file's content directly, but a redirect to a Cloudfront URL), our server uses "X-Linked-Etag"
to include the sha256
of the actual linked file (the large file itself).
It made more sense to us to use this hash (which we already have and don't have to compute again, as it can be very costly for super large files), but then it's not really the Etag of the redirect response itself.
We could probably document this better at some point.
matching_files = [ | ||
file | ||
for file in fnmatch.filter(os.listdir(cache_dir), filename.split(".")[0] + ".*") | ||
if not file.endswith(".json") and not file.endswith(".lock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change request but just a FYI for the curious reader :-) – this would be much nicer using pathlib
:
matching_files = [f.name for f in cache_dir.glob(cache_path.with_suffix(".*").name)
if f.suffix not in {".json", ".lock"}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True!
Looks very good to me |
What about uploads? 🤔 |
re. uploads: yes. In But the design goal is to be able to upload models using just The one thing that's not supported out of the box on the upload side is upload of files larger than 5GB: you need a custom lfs transfer agent that's currently bundled to |
Add new dependencies from #377
Hi Asteroid team! I hereby propose an integration with the HuggingFace model hub 🤗🤗
In this Pull request:
cached_download()
, and inhf_bucket_url()
to support falling back to the huggingface.co model hub if we don't know what to do with the method param.If it's too much code, feel free to remove/update some of it. We could also spin this code off into a utility library (
pip install huggingface_hub
?) at some point.The unit tests mostly check that we can resolve model ids to local paths, and adds an example test that loads a model from a hf model id (I had to update the model to include a sample rate inside the .bin – see this commit – this is actually a nice example of the power of model versioning 🔥)
For reference, model is at https://huggingface.co/julien-c/DPRNNTasNet-ks16_WHAM_sepclean
Also cc'ing @thomwolf for info!