Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Million-AID dataset #455

Merged
merged 11 commits into from
Jul 9, 2022
Merged

Million-AID dataset #455

merged 11 commits into from
Jul 9, 2022

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Mar 9, 2022

This PR adds the Million-AID dataset which contains one million aerial scenes from Google Earth engine.

Comments/Questions:

  • It offers both a multi-class (51 classes) and a multi-label (73 labels) task, for which I added an option in the constructor
  • images can have two or three labels in the multi-label case so the __getitem__ currently returns variable length label tensor
  • I am not sure how to best handle the download because the total file is more than 200GB, so is there a way to maybe split it somehow?

Plot Examples:

@github-actions github-actions bot added datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing labels Mar 9, 2022
@adamjstewart adamjstewart added this to the 0.3.0 milestone Mar 10, 2022
@adamjstewart
Copy link
Collaborator

Bleh, I'll fix the mypy issues in another PR, it looks like PyTorch finally added type hints for many of its functions in the latest release that came out yesterday.

@calebrob6
Copy link
Member

@nilsleh can you rebase on main here

@adamjstewart
Copy link
Collaborator

Rebasing isn't necessary, just need to remove the few remaining type ignores that are causing the tests to fail.

tests/data/millionaid/data.py Outdated Show resolved Hide resolved
tests/datasets/test_millionaid.py Outdated Show resolved Hide resolved
tests/datasets/test_millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
adamjstewart
adamjstewart previously approved these changes Apr 4, 2022
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Outdated Show resolved Hide resolved
torchgeo/datasets/millionaid.py Show resolved Hide resolved
Copy link
Collaborator

@isaaccorley isaaccorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test set is ~260GB and multiple parts compared to the train set which is a single zip file. I wonder if we even want users to be downloading this in a single sequential process. @adamjstewart

}
url = {
"train": "https://eastus1-mediap.svc.ms/transform/zip?cs=fFNQTw",
"test": "https://eastus1-mediap.svc.ms/transform/zip?cs=fFNQTw",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually download the test set? These look like the same url to me. Also looks like the test set is made of multiple parts (e.g. test.zip.001, test.zip.002, etc.). Does extract_archive support this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question, I'm not sure whether that will work or not. We've had to hack things to support deflate64-compressed zip files before, so multi-part zip files should be possible somehow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually download the test set? These look like the same url to me.

If I look in the OneDrive and download the train or test folder, the download link for me are the same for some reason.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isaaccorley you mentioned that you downloaded the test set and computed its MD5. Where did you download the test set from? Was it multiple parts when you downloaded and checksummed it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isaaccorley you mentioned that you downloaded the test set and computed its MD5. Where did you download the test set from? Was it multiple parts when you downloaded and checksummed it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually just noticed that test.zip was corrupted and I couldn't unzip it completely so I think the hash may be incorrect. Going to try and download each of the test files individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this resolved?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was ever resolved, this is the only thing holding up this PR. If there's a single link for all data that's fine, we can download the data with one link, checksum it once, then extract it and extract and zip files that contains. I don't have the bandwidth/storage to download this myself, but can someone investigate this and see if it's even possible to download this dataset? If not, we could just remove the download logic until we figure it out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unfortunately also do not have the bandwidth/storage for this download :/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did anyone ever reach out to the dataset authors and see if we can rehost a single zip file on something like zenodo?

@adamjstewart
Copy link
Collaborator

Since the download logic doesn't currently work, I'm going to remove it so we can get this into 0.3.0. We can add download support in a future release.

@adamjstewart adamjstewart self-assigned this Jul 9, 2022
@@ -16,6 +16,7 @@ Dataset,Task,Source,# Samples,# Classes,Size (px),Resolution (m),Bands
`LandCover.ai`_,S,Aerial,"10,674",5,512x512,0.25--0.5,RGB
`LEVIR-CD+`_,CD,Google Earth,985,2,"1,024x1,024",0.5,RGB
`LoveDA`_,S,Google Earth,"5,987",7,"1,024x1,024",0.3,RGB
`Million-AID`_,C,Google Earth,1M,51--73,,0.5--153,RGB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know the range of image sizes?

@adamjstewart adamjstewart dismissed stale reviews from isaaccorley and calebrob6 July 9, 2022 22:02

Download logic removed

@adamjstewart adamjstewart enabled auto-merge (squash) July 9, 2022 22:03
@adamjstewart adamjstewart merged commit 2d14883 into microsoft:main Jul 9, 2022
@adamjstewart adamjstewart mentioned this pull request Jul 11, 2022
yichiac pushed a commit to yichiac/torchgeo that referenced this pull request Apr 29, 2023
* millionaid

* test

* separator

* remove type ignore

* type in test

* requested changes

* typos and glob pattern

* task argument description

* add test md5 hash

* Remove download logic

* Type ignore no longer needed

Co-authored-by: Adam J. Stewart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants