You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've discovered two instances of bias in our current RandomGeoSampler and RandomBatchGeoSampler implementations.
Area bias
In our current implementations, we first select a tile uniformly at random, then choose a random patch from that tile. This means that small tiles are just as likely to be sampled as large tiles. For most datasets, this is not an issue, as all tiles are approximately the same size. However, for IntersectionDatasets, the area of each intersection can vary widely. Tiles that are barely large enough to return a single sample from will be sampled no less often than massive tiles. This is particularly problematic for RandomBatchGeoSampler which would sample an entire batch of images from that very small tile.
This is an issue inherent in our current implementation, but is relatively easy to fix. The solution would be to use a weighted random sampler where weights are derived from the area of each image. Therefore, large tiles will be more likely to be sampled from than small tiles.
Latitude bias
In certain 2D projections like Mercator, polar regions have the same area as equatorial regions. However, this is not the case on the actual 3D Earth. If we sample at random, we end up oversampling the poles relative to the equator. This has serious consequences for model training including models that are biased towards polar regions or that overestimate/underestimate certain climate patterns.
This is an issue inherent to 2D projections of the Earth in general, not necessarily an issue in TorchGeo. We could still try to do something about this however. One solution would be to force the R-tree index to be in an equal-area projection like Albers. Another solution would be to force the index to be in Mercator and to use a weighted random sampler where the weight comes from (the square root of?) the latitude.
The text was updated successfully, but these errors were encountered:
I think latitude bias is going to be difficult to correct. It depends on the CRS being used (equal angle CRSs are affected but not equal area) so we can't simply use a weighted random sampler without first checking against a list of known equal angle CRSs. It's also unclear what to do for non-equal angle and non-equal area CRSs.
I've discovered two instances of bias in our current
RandomGeoSampler
andRandomBatchGeoSampler
implementations.Area bias
In our current implementations, we first select a tile uniformly at random, then choose a random patch from that tile. This means that small tiles are just as likely to be sampled as large tiles. For most datasets, this is not an issue, as all tiles are approximately the same size. However, for
IntersectionDatasets
, the area of each intersection can vary widely. Tiles that are barely large enough to return a single sample from will be sampled no less often than massive tiles. This is particularly problematic forRandomBatchGeoSampler
which would sample an entire batch of images from that very small tile.This is an issue inherent in our current implementation, but is relatively easy to fix. The solution would be to use a weighted random sampler where weights are derived from the area of each image. Therefore, large tiles will be more likely to be sampled from than small tiles.
Latitude bias
In certain 2D projections like Mercator, polar regions have the same area as equatorial regions. However, this is not the case on the actual 3D Earth. If we sample at random, we end up oversampling the poles relative to the equator. This has serious consequences for model training including models that are biased towards polar regions or that overestimate/underestimate certain climate patterns.
This is an issue inherent to 2D projections of the Earth in general, not necessarily an issue in TorchGeo. We could still try to do something about this however. One solution would be to force the R-tree index to be in an equal-area projection like Albers. Another solution would be to force the index to be in Mercator and to use a weighted random sampler where the weight comes from (the square root of?) the latitude.
The text was updated successfully, but these errors were encountered: