[hub_utils] Support for huggingface model hub #377

julien-c · 2020-12-04T22:05:13Z

Hi Asteroid team! I hereby propose an integration with the HuggingFace model hub 🤗🤗

In this Pull request:

The few important code changes (not too many lines) are in cached_download(), and in hf_bucket_url() to support falling back to the huggingface.co model hub if we don't know what to do with the method param.
The rest of the additions are copy/pasted from transformers and are not strictly speaking required, but should work well and cover a few additional edge cases compared to your current file downloading/caching code.

If it's too much code, feel free to remove/update some of it. We could also spin this code off into a utility library (pip install huggingface_hub?) at some point.

The unit tests mostly check that we can resolve model ids to local paths, and adds an example test that loads a model from a hf model id (I had to update the model to include a sample rate inside the .bin – see this commit – this is actually a nice example of the power of model versioning 🔥)

For reference, model is at https://huggingface.co/julien-c/DPRNNTasNet-ks16_WHAM_sepclean

Also cc'ing @thomwolf for info!

After https://huggingface.co/julien-c/DPRNNTasNet-ks16_WHAM_sepclean/commit/d01f179c5687d1942a99394408eb426c18dfd03d

asteroid/utils/hub_utils.py

mpariente · 2020-12-05T00:16:21Z

asteroid/utils/hub_utils.py

+        # Note to maintainers:
+        # You can remove the `return hf_get_from_cache(…)` line above
+        # if you want to keep the exact same file downloading/caching behavior
+        # as the current one. In which case, you can remove all functions below
+        # except for `hf_bucket_url`.
+        # However the implementation adds some nice features
+        # (notably versioning-aware caching) so I'd suggest keeping it.


We'll keep hf_get_from_cache because we'll drop Zenodo support as soon as our models are fully migrated to HuggingFace's hub.
And we'll add a note about this behaviour in the docstring above.
Thanks for the note!

mpariente · 2020-12-05T00:23:18Z

Thank you so much for this @julien-c, this is really exciting!

If it's too much code, feel free to remove/update some of it. We could also spin this code off into a utility library (pip install huggingface_hub?) at some point.

This would make complete sense at some point, but the amount of copy/paste is fine, we'll keep everything.

We need to somehow be able to maintain that code so I'll have to read it carefully, I don't grasp everything yet 🙃

jonashaag · 2020-12-05T10:47:50Z

asteroid/utils/hub_utils.py

+        # e.g. julien-c/DPRNNTasNet-ks16_WHAM_sepclean is a valid model id
+        # and  julien-c/DPRNNTasNet-ks16_WHAM_sepclean@main supports specifying a commit/branch/tag.
+        if "@" in filename_or_url:
+            model_id = filename_or_url.split("@")[0]


Did you know model_id, revision = filename_or_url.split("@")? :) (assuming exactly 1 @)

jonashaag · 2020-12-05T10:56:06Z

asteroid/utils/hub_utils.py

+                url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout
+            )
+            r.raise_for_status()
+            etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag")


Can you tell why the custom header?

In case of an LFS file (i.e. where the HTTP request does not return the file's content directly, but a redirect to a Cloudfront URL), our server uses "X-Linked-Etag" to include the sha256 of the actual linked file (the large file itself).

It made more sense to us to use this hash (which we already have and don't have to compute again, as it can be very costly for super large files), but then it's not really the Etag of the redirect response itself.

We could probably document this better at some point.

jonashaag · 2020-12-05T11:02:13Z

asteroid/utils/hub_utils.py

+            matching_files = [
+                file
+                for file in fnmatch.filter(os.listdir(cache_dir), filename.split(".")[0] + ".*")
+                if not file.endswith(".json") and not file.endswith(".lock")


Not a change request but just a FYI for the curious reader :-) – this would be much nicer using pathlib:

matching_files = [f.name for f in cache_dir.glob(cache_path.with_suffix(".*").name) if f.suffix not in {".json", ".lock"}]

jonashaag · 2020-12-05T11:05:58Z

Looks very good to me

jonashaag · 2020-12-05T11:08:17Z

What about uploads? 🤔

mpariente · 2020-12-05T11:46:56Z

You create a repo on the hub's page, clone it and push to it what you want, no fancy CLI required.

julien-c · 2020-12-07T10:12:18Z

re. uploads: yes. In transformers we have programmatic ability to create (and even delete) model repos, which we could also spin off into the same small utility-focused library at some point.

But the design goal is to be able to upload models using just git + git-lfs so that it's quite independent from library implementations.

The one thing that's not supported out of the box on the upload side is upload of files larger than 5GB: you need a custom lfs transfer agent that's currently bundled to transformers (and in the future in a separate library). That's quite a big model size though so should be rare.

asteroid/utils/hub_utils.py

Add new dependencies from #377

julien-c added 8 commits December 4, 2020 22:59

[hub_utils] Support for huggingface model hub

ce41cb8

Update hub_utils_test.py

1c81542

Test for model_id@commit

82057ee

Add note to maintainers

95a9c85

Add failing test (because of sample rate)

9a6d0f4

Re-trigger CI

3c5368b

After https://huggingface.co/julien-c/DPRNNTasNet-ks16_WHAM_sepclean/commit/d01f179c5687d1942a99394408eb426c18dfd03d

Sacrificial gift to the codecov gods

269e3a3

I hope nobody sees this 😇

caa0e34

julien-c marked this pull request as ready for review December 4, 2020 23:24

mpariente reviewed Dec 4, 2020

View reviewed changes

asteroid/utils/hub_utils.py Show resolved Hide resolved

mpariente reviewed Dec 5, 2020

View reviewed changes

jonashaag approved these changes Dec 5, 2020

View reviewed changes

mpariente reviewed Dec 7, 2020

View reviewed changes

asteroid/utils/hub_utils.py Outdated Show resolved Hide resolved

asteroid/utils/hub_utils.py Show resolved Hide resolved

Update asteroid/utils/hub_utils.py

73af98a

mpariente merged commit b13741e into asteroid-team:master Dec 7, 2020

mpariente added a commit that referenced this pull request Dec 7, 2020

[hub] Fix torch.hub tests

3a458ca

Add new dependencies from #377

julien-c deleted the hf_model_hub branch December 21, 2020 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hub_utils] Support for huggingface model hub #377

[hub_utils] Support for huggingface model hub #377

julien-c commented Dec 4, 2020 •

edited

Loading

mpariente Dec 5, 2020

mpariente commented Dec 5, 2020

jonashaag Dec 5, 2020

jonashaag Dec 5, 2020

julien-c Dec 7, 2020

jonashaag Dec 5, 2020

julien-c Dec 7, 2020

jonashaag commented Dec 5, 2020

jonashaag commented Dec 5, 2020

mpariente commented Dec 5, 2020

julien-c commented Dec 7, 2020 •

edited

Loading

[hub_utils] Support for huggingface model hub #377

[hub_utils] Support for huggingface model hub #377

Conversation

julien-c commented Dec 4, 2020 • edited Loading

Hi Asteroid team! I hereby propose an integration with the HuggingFace model hub 🤗🤗

mpariente Dec 5, 2020

Choose a reason for hiding this comment

mpariente commented Dec 5, 2020

jonashaag Dec 5, 2020

Choose a reason for hiding this comment

jonashaag Dec 5, 2020

Choose a reason for hiding this comment

julien-c Dec 7, 2020

Choose a reason for hiding this comment

jonashaag Dec 5, 2020

Choose a reason for hiding this comment

julien-c Dec 7, 2020

Choose a reason for hiding this comment

jonashaag commented Dec 5, 2020

jonashaag commented Dec 5, 2020

mpariente commented Dec 5, 2020

julien-c commented Dec 7, 2020 • edited Loading

julien-c commented Dec 4, 2020 •

edited

Loading

julien-c commented Dec 7, 2020 •

edited

Loading