-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redistribute datasets and models on Hugging Face #1073
Comments
Starting a work-in-progress list so that multiple people don't contact the same person. DatasetsIn-progress
Completed
ModelsIn-progress
Completed
|
I think DynamicEarthNet is re-distributable (based on a conversation with @lukaskondmann) |
This is correct. DynamicEarthNet is available under this license so redistribution is possible as long as attribution is given |
So2Sat is okay to be mirrored based on #388. |
From email conversations, OSCD and HRSCD both have CCA licenses which freely allow redistribution. ReforesTree may require permission from the authors. They have a shared data agreement with WWF. They were able to redistribute on Zenodo, but we should check back with them to see if we can redistribute on Hugging Face. |
@calebrob6 I would like to redistribute the USAVars dataset if possible because download is super slow and failing several times. However, I am not sure what the actual source of this dataset is since it is only a reproduction. I saw that you had a repo about the paper, so wondering if you know something about the source and license of the torchgeo USAVars dataset? |
Hey @nilsleh, yes, I helped create that dataset. We should definitely move it to HuggingFace. @estherrolf is soon going to make changes to the dataset so perhaps we can do that all together. |
I would also like to +1 this. I've been having a ton of issues accessing model weights and files from Radiant Earth and I suspect they are no longer actively maintaining their endpoints. |
Hugging Face has a maximum individual file size of 50 GB 😢 |
We're aware of these issues, it's due to a combination of issues ranging from architectural limitations to issues with Azure blob storage which haven't been resolved yet. We're working on an updated version of MLHub which resolves these issues which will be available in the near future. |
With #1240 merged, can we move the USAVars dataset to HF? Because at the moment the download keeps failing through torchgeo. I still have the dataset locally, so I could upload it to HF and open a PR to change the download links :) @calebrob6, @estherrolf |
USAVars is CC-BY-4.0, so yet we can redistribute on HF if you want. |
Summary
We should consider redistributing as many datasets and pre-trained models as we can on Hugging Face.
Rationale
Hugging Face provides a more reliable centralized repository for storing large binary files. It's a large company, so we don't have to worry about expired SSL certificates or servers going offline. We have full control over the files we upload, so we can make modifications (license permitting) to fix inconsistencies between model architectures.
It also provides significantly faster download speeds compared to similar sites. For example, for our ResNet-50 pre-trained weights (~100 MB):
For the EuroSAT dataset (~2 GB):
Implementation
First, we need to ensure that the dataset or model we are redistributing has a license that permits redistribution. If a license is missing or does not permit redistribution, we should reach out to the authors to see if a permissive license can be granted.
Once licensing is settled, we just need to upload the dataset or model to Hugging Face. The license chosen should match the original license. Any modifications from the original should be clearly documented, and a link should be added to the original source. This is required by many licenses, and is just a good idea to document in general.
Finally, the URL (and possibly MD5) in TorchGeo should be updated to point to the new download location.
Alternatives
We previously used Zenodo for this but download speeds were abysmal. A quick survey of UIUC AI PhD students found that everyone uses Hugging Face 🤗
Additional information
We already have quite a lot of datasets, and dataset authors are often unresponsive to these kinds of inquiries. It's likely unrealistic to expect that we'll be able to redistribute every dataset and model, so I won't start a checklist just yet. High priority datasets and models include:
Again, we have to check the license first. Many datasets that cannot be automatically downloaded are for legal reasons.
The text was updated successfully, but these errors were encountered: