Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hub_utils] Support for huggingface model hub #377

Merged
merged 9 commits into from
Dec 7, 2020

Conversation

julien-c
Copy link
Contributor

@julien-c julien-c commented Dec 4, 2020

Hi Asteroid team! I hereby propose an integration with the HuggingFace model hub 🤗🤗

In this Pull request:

  • The few important code changes (not too many lines) are in cached_download(), and in hf_bucket_url() to support falling back to the huggingface.co model hub if we don't know what to do with the method param.
  • The rest of the additions are copy/pasted from transformers and are not strictly speaking required, but should work well and cover a few additional edge cases compared to your current file downloading/caching code.

If it's too much code, feel free to remove/update some of it. We could also spin this code off into a utility library (pip install huggingface_hub?) at some point.

The unit tests mostly check that we can resolve model ids to local paths, and adds an example test that loads a model from a hf model id (I had to update the model to include a sample rate inside the .bin – see this commit – this is actually a nice example of the power of model versioning 🔥)

For reference, model is at https://huggingface.co/julien-c/DPRNNTasNet-ks16_WHAM_sepclean

Also cc'ing @thomwolf for info!

@julien-c julien-c marked this pull request as ready for review December 4, 2020 23:24
Comment on lines 75 to 81
# Note to maintainers:
# You can remove the `return hf_get_from_cache(…)` line above
# if you want to keep the exact same file downloading/caching behavior
# as the current one. In which case, you can remove all functions below
# except for `hf_bucket_url`.
# However the implementation adds some nice features
# (notably versioning-aware caching) so I'd suggest keeping it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll keep hf_get_from_cache because we'll drop Zenodo support as soon as our models are fully migrated to HuggingFace's hub.
And we'll add a note about this behaviour in the docstring above.
Thanks for the note!

@mpariente
Copy link
Collaborator

Thank you so much for this @julien-c, this is really exciting!

If it's too much code, feel free to remove/update some of it. We could also spin this code off into a utility library (pip install huggingface_hub?) at some point.

This would make complete sense at some point, but the amount of copy/paste is fine, we'll keep everything.

We need to somehow be able to maintain that code so I'll have to read it carefully, I don't grasp everything yet 🙃

# e.g. julien-c/DPRNNTasNet-ks16_WHAM_sepclean is a valid model id
# and julien-c/DPRNNTasNet-ks16_WHAM_sepclean@main supports specifying a commit/branch/tag.
if "@" in filename_or_url:
model_id = filename_or_url.split("@")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you know model_id, revision = filename_or_url.split("@")? :) (assuming exactly 1 @)

url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout
)
r.raise_for_status()
etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell why the custom header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of an LFS file (i.e. where the HTTP request does not return the file's content directly, but a redirect to a Cloudfront URL), our server uses "X-Linked-Etag" to include the sha256 of the actual linked file (the large file itself).

It made more sense to us to use this hash (which we already have and don't have to compute again, as it can be very costly for super large files), but then it's not really the Etag of the redirect response itself.

We could probably document this better at some point.

matching_files = [
file
for file in fnmatch.filter(os.listdir(cache_dir), filename.split(".")[0] + ".*")
if not file.endswith(".json") and not file.endswith(".lock")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change request but just a FYI for the curious reader :-) – this would be much nicer using pathlib:

matching_files =  [f.name for f in cache_dir.glob(cache_path.with_suffix(".*").name)
                   if f.suffix not in {".json", ".lock"}]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True!

@jonashaag
Copy link
Collaborator

Looks very good to me

@jonashaag
Copy link
Collaborator

What about uploads? 🤔

@mpariente
Copy link
Collaborator

You create a repo on the hub's page, clone it and push to it what you want, no fancy CLI required.

image

@julien-c
Copy link
Contributor Author

julien-c commented Dec 7, 2020

re. uploads: yes. In transformers we have programmatic ability to create (and even delete) model repos, which we could also spin off into the same small utility-focused library at some point.

But the design goal is to be able to upload models using just git + git-lfs so that it's quite independent from library implementations.

The one thing that's not supported out of the box on the upload side is upload of files larger than 5GB: you need a custom lfs transfer agent that's currently bundled to transformers (and in the future in a separate library). That's quite a big model size though so should be rare.

@mpariente mpariente merged commit b13741e into asteroid-team:master Dec 7, 2020
mpariente added a commit that referenced this pull request Dec 7, 2020
Add new dependencies from #377
@julien-c julien-c deleted the hf_model_hub branch December 21, 2020 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants