RandomGeoSampler bias #408

adamjstewart · 2022-02-16T20:06:41Z

I've discovered two instances of bias in our current RandomGeoSampler and RandomBatchGeoSampler implementations.

Area bias

In our current implementations, we first select a tile uniformly at random, then choose a random patch from that tile. This means that small tiles are just as likely to be sampled as large tiles. For most datasets, this is not an issue, as all tiles are approximately the same size. However, for IntersectionDatasets, the area of each intersection can vary widely. Tiles that are barely large enough to return a single sample from will be sampled no less often than massive tiles. This is particularly problematic for RandomBatchGeoSampler which would sample an entire batch of images from that very small tile.

This is an issue inherent in our current implementation, but is relatively easy to fix. The solution would be to use a weighted random sampler where weights are derived from the area of each image. Therefore, large tiles will be more likely to be sampled from than small tiles.

Latitude bias

In certain 2D projections like Mercator, polar regions have the same area as equatorial regions. However, this is not the case on the actual 3D Earth. If we sample at random, we end up oversampling the poles relative to the equator. This has serious consequences for model training including models that are biased towards polar regions or that overestimate/underestimate certain climate patterns.

This is an issue inherent to 2D projections of the Earth in general, not necessarily an issue in TorchGeo. We could still try to do something about this however. One solution would be to force the R-tree index to be in an equal-area projection like Albers. Another solution would be to force the index to be in Mercator and to use a weighted random sampler where the weight comes from (the square root of?) the latitude.

The text was updated successfully, but these errors were encountered:

adamjstewart · 2022-03-19T15:36:54Z

I think latitude bias is going to be difficult to correct. It depends on the CRS being used (equal angle CRSs are affected but not equal area) so we can't simply use a weighted random sampler without first checking against a list of known equal angle CRSs. It's also unclear what to do for non-equal angle and non-equal area CRSs.

adamjstewart · 2022-03-19T15:37:58Z

Correcting area bias will require the new bbox.area attributes introduced in 0.3.0.

adamjstewart added the samplers Samplers for indexing datasets label Feb 16, 2022

adamjstewart added this to the 0.2.2 milestone Mar 19, 2022

adamjstewart modified the milestones: 0.2.2, 0.3.0 Mar 19, 2022

adamjstewart mentioned this issue Mar 22, 2022

RandomGeoSampler: several bug fixes #477

Merged

4 tasks

calebrob6 closed this as completed in #477 Apr 5, 2022

adamjstewart mentioned this issue Jul 11, 2022

0.3.0 release #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomGeoSampler bias #408

RandomGeoSampler bias #408

adamjstewart commented Feb 16, 2022

adamjstewart commented Mar 19, 2022

adamjstewart commented Mar 19, 2022

RandomGeoSampler bias #408

RandomGeoSampler bias #408

Comments

adamjstewart commented Feb 16, 2022

Area bias

Latitude bias

adamjstewart commented Mar 19, 2022

adamjstewart commented Mar 19, 2022