Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weighted geo sampler #757

Open
Geethen opened this issue Sep 5, 2022 · 8 comments
Open

weighted geo sampler #757

Geethen opened this issue Sep 5, 2022 · 8 comments
Labels
samplers Samplers for indexing datasets

Comments

@Geethen
Copy link

Geethen commented Sep 5, 2022

Summary

in the scenario of imbalanced datasets, the use of the current samplers may not assist with imbalanced samples.

I am currently trying to get rid of samples with only 0 metre heights in the mask (water regions).

Rationale

No response

Implementation

No response

Alternatives

No response

Additional information

No response

@adamjstewart
Copy link
Collaborator

I like the idea, but how would you implement it? Unlike NonGeoDatasets, GeoDatasets will recursively search for files on disk, so you can't just pass in a list of weights. You could compute those weights, but how would you make a single class that is generic enough to allow users to do this?

@adamjstewart adamjstewart added the samplers Samplers for indexing datasets label Sep 5, 2022
@calebrob6
Copy link
Member

calebrob6 commented Sep 5, 2022 via email

@adamjstewart
Copy link
Collaborator

You could get a list of filenames from RasterDataset's index, compute weights, then pass those to the sampler.

This feels a bit fragile. For example, if your dataset is an IntersectionDataset or UnionDataset, you now need to be more careful because each "hit" could be both image and label, or from a different dataset entirely. But yes, this could work.

I'll note this is a good reason why RasterDatasets should be able to be instantiated from a list of filenames.

Should be easier to support a list of filenames for instantiation when we move to TorchData.

@calebrob6
Copy link
Member

calebrob6 commented Sep 5, 2022 via email

@adamjstewart
Copy link
Collaborator

It's not hard to support without TorchData, but it becomes easier to support with TorchData because the user can construct their own data loading pipeline with a set of common operations. So they can choose whether they want to specify a list of files, or recursively search a directory, or use a STAC API, or whatever. I also still need to investigate TorchData. I'm hoping it doesn't put all of the work on the user.

@isaaccorley
Copy link
Collaborator

This seems like 2 separate problems.

  1. Dealing with sampling from imbalanced datasets
  2. You are trying to remove areas where a value in a mask is zero. Could a possible solution be to create another mask Raster Dataset where values aren't 0 in the original mask and then take the intersection of these?

@Geethen
Copy link
Author

Geethen commented Sep 6, 2022

This seems like 2 separate problems.

  1. Dealing with sampling from imbalanced datasets

This is the broader problem. One of the ways I would approach this would be to generate a grid based on a user-specified criteria (pixel width, pixel height and nSamples), then get the percentage cover of each label value per grid cell (patch), lastly filter out any patches that do not meet the weight criteria specified by the user? for example, in my case, any cell with less than equal to 50% cover of zero is allowed. I could quickly and easily do this in earth engine but have no idea how to go about this using python. I will implement this in GEE to preprocess the data I use for now. In the case of a regression problem and in my case, it just the zero value that is problematic. so the problem is slightly more simplified compared to multi-class classification problem.

  1. You are trying to remove areas where a value in a mask is zero. Could a possible solution be to create another mask Raster Dataset where values aren't 0 in the original mask and then take the intersection of these?

it is beneficial to have some zero labels to learn from. Also I do not think torchgeo supports irregular polygons, only bounding boxes for intersection datasets.

@adamjstewart
Copy link
Collaborator

FYI, we are planning on working on this for our time series efforts. All samplers will allow users to pass in weights, not just a single WeightedGeoSampler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
samplers Samplers for indexing datasets
Projects
None yet
Development

No branches or pull requests

4 participants