Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redistribute datasets and models on Hugging Face #1073

Open
adamjstewart opened this issue Jan 31, 2023 · 12 comments
Open

Redistribute datasets and models on Hugging Face #1073

adamjstewart opened this issue Jan 31, 2023 · 12 comments
Labels
datasets Geospatial or benchmark datasets models Models and pretrained weights

Comments

@adamjstewart
Copy link
Collaborator

Summary

We should consider redistributing as many datasets and pre-trained models as we can on Hugging Face.

Rationale

Hugging Face provides a more reliable centralized repository for storing large binary files. It's a large company, so we don't have to worry about expired SSL certificates or servers going offline. We have full control over the files we upload, so we can make modifications (license permitting) to fix inconsistencies between model architectures.

It also provides significantly faster download speeds compared to similar sites. For example, for our ResNet-50 pre-trained weights (~100 MB):

  • Zenodo: 2 min, 45 sec
  • Hugging Face: 8 sec

For the EuroSAT dataset (~2 GB):

  • DFKI: 4 min, 11 sec
  • Hugging Face: 3 min, 7 sec

Implementation

First, we need to ensure that the dataset or model we are redistributing has a license that permits redistribution. If a license is missing or does not permit redistribution, we should reach out to the authors to see if a permissive license can be granted.

Once licensing is settled, we just need to upload the dataset or model to Hugging Face. The license chosen should match the original license. Any modifications from the original should be clearly documented, and a link should be added to the original source. This is required by many licenses, and is just a good idea to document in general.

Finally, the URL (and possibly MD5) in TorchGeo should be updated to point to the new download location.

Alternatives

We previously used Zenodo for this but download speeds were abysmal. A quick survey of UIUC AI PhD students found that everyone uses Hugging Face 🤗

Additional information

We already have quite a lot of datasets, and dataset authors are often unresponsive to these kinds of inquiries. It's likely unrealistic to expect that we'll be able to redistribute every dataset and model, so I won't start a checklist just yet. High priority datasets and models include:

  • Unable to automatically download
  • Unreliable SSL certificates
  • Very large datasets that flake out during download
  • Slow download speeds

Again, we have to check the license first. Many datasets that cannot be automatically downloaded are for legal reasons.

@adamjstewart adamjstewart added datasets Geospatial or benchmark datasets models Models and pretrained weights labels Jan 31, 2023
@adamjstewart
Copy link
Collaborator Author

adamjstewart commented Jan 31, 2023

Starting a work-in-progress list so that multiple people don't contact the same person.

Datasets

In-progress

Source License Reason
USAVars Not sure yet Slow and failing download

Completed

Source License Reason
EuroSat EU Law Expired SSL certificate
UC Merced public domain HTTP-only

Models

In-progress

Source License Reason

Completed

Source License Reason
Zhu Lab CC-BY-4.0 Required modifications
ServiceNow Apache-2.0 Required modifications

@calebrob6
Copy link
Member

I think DynamicEarthNet is re-distributable (based on a conversation with @lukaskondmann)

@lukaskondmann
Copy link

This is correct. DynamicEarthNet is available under this license so redistribution is possible as long as attribution is given

@calebrob6
Copy link
Member

So2Sat is okay to be mirrored based on #388.

@adamjstewart
Copy link
Collaborator Author

From email conversations, OSCD and HRSCD both have CCA licenses which freely allow redistribution.

ReforesTree may require permission from the authors. They have a shared data agreement with WWF. They were able to redistribute on Zenodo, but we should check back with them to see if we can redistribute on Hugging Face.

@nilsleh
Copy link
Collaborator

nilsleh commented Feb 21, 2023

@calebrob6 I would like to redistribute the USAVars dataset if possible because download is super slow and failing several times. However, I am not sure what the actual source of this dataset is since it is only a reproduction. I saw that you had a repo about the paper, so wondering if you know something about the source and license of the torchgeo USAVars dataset?

@calebrob6
Copy link
Member

Hey @nilsleh, yes, I helped create that dataset. We should definitely move it to HuggingFace. @estherrolf is soon going to make changes to the dataset so perhaps we can do that all together.

@yeelauren
Copy link

I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints.

@adamjstewart
Copy link
Collaborator Author

Hugging Face has a maximum individual file size of 50 GB 😢

@kbgg
Copy link

kbgg commented Mar 24, 2023

I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints.

We're aware of these issues, it's due to a combination of issues ranging from architectural limitations to issues with Azure blob storage which haven't been resolved yet. We're working on an updated version of MLHub which resolves these issues which will be available in the near future.

@nilsleh
Copy link
Collaborator

nilsleh commented Jun 21, 2023

With #1240 merged, can we move the USAVars dataset to HF? Because at the moment the download keeps failing through torchgeo. I still have the dataset locally, so I could upload it to HF and open a PR to change the download links :) @calebrob6, @estherrolf

@adamjstewart
Copy link
Collaborator Author

USAVars is CC-BY-4.0, so yet we can redistribute on HF if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets models Models and pretrained weights
Projects
None yet
Development

No branches or pull requests

6 participants