-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoDataset: non-deterministic behavior #1899
Comments
I believe that at least for the case of However, I am pretty sure that this whole thing is due to the # Using set to remove any duplicates if directories are overlapping
files: set[str] = set()
for path in paths:
if os.path.isdir(path):
pathname = os.path.join(path, "**", self.filename_glob)
files |= set(glob.iglob(pathname, recursive=True))
elif os.path.isfile(path) or path_is_vsi(path):
files.add(path)
else:
warnings.warn(
f"Could not find any relevant files for provided path '{path}'. "
f"Path was ignored.",
UserWarning,
)
return files I see that |
random_bbox_assignment
non-deterministic behavior
Thanks for catching this, this is a HUGE bug! From what I can tell, the implications of this bug are as follows:
This bug was introduced in #1442 and #1597, and is included in releases 0.5.0 and 0.5.1. Sorry I didn't catch this during review @adriantre. We'll fix this in the next release. I just finished presenting TorchGeo at a reproducibility workshop, so this is a bit embarrassing for me personally... 😅 I was able to reproduce implication 1 as follows. First, apply the following patch: diff --git a/torchgeo/datasets/l7irish.py b/torchgeo/datasets/l7irish.py
index fe9e6b4a..fd9eeb89 100644
--- a/torchgeo/datasets/l7irish.py
+++ b/torchgeo/datasets/l7irish.py
@@ -183,6 +183,7 @@ class L7Irish(RasterDataset):
"""
hits = self.index.intersection(tuple(query), objects=True)
filepaths = cast(list[str], [hit.object for hit in hits])
+ print(filepaths)
if not filepaths:
raise IndexError( Then, run the following code: from torch.utils.data import DataLoader
from lightning.pytorch import seed_everything
from torchgeo.datasets import L7Irish, stack_samples
from torchgeo.samplers import RandomGeoSampler
seed_everything(0)
dataset = L7Irish(paths="data/l7irish", download=True)
sampler = RandomGeoSampler(dataset, size=64, length=10)
dataloader = DataLoader(dataset, sampler=sampler, collate_fn=stack_samples)
for batch in dataloader:
pass Every time you run this program, you'll notice that the order changes. Implications 2 and 3 follow by definition and can be easily reproduced as you described. I believe your fix of sorting the output of diff --git a/torchgeo/datasets/geo.py b/torchgeo/datasets/geo.py
index 1e2382db..c044afb7 100644
--- a/torchgeo/datasets/geo.py
+++ b/torchgeo/datasets/geo.py
@@ -287,7 +287,7 @@ class GeoDataset(Dataset[dict[str, Any]], abc.ABC):
self._res = new_res
@property
- def files(self) -> set[str]:
+ def files(self) -> list[str]:
"""A list of all files in the dataset.
Returns:
@@ -316,7 +316,7 @@ class GeoDataset(Dataset[dict[str, Any]], abc.ABC):
UserWarning,
)
- return files
+ return sorted(files)
I would love to figure out a good way to test these things. We could test that |
Hi, thanks for taking the time to investigate the issue; I’m glad to help out! I can take on the PR, but a word of warning, I’m really new to production-level CI/CD so this may take a bit until I understand the contribution guidelines. Regarding testing, I think that proving that each part of TorchGeo is individually deterministic should be enough to assume the same for the whole thing. So, maybe just confirming that |
No worries, we have documentation for this: https://torchgeo.readthedocs.io/en/stable/user/contributing.html. Unit tests will go in
Kind of. This ensures that |
Thanks for the offer! Will attempt to push something today because this actually affects experimentation my thesis work and I'd like to get it out of the way ASAP. I see why you're scared of sets now haha! |
Description
The
random_bbox_assignment
dataset splitter (as well as all other spatial index-based splitters I suppose) is not deterministic between different program executions.That is, splitting a dataset multiple times during the same runtime (e.g., by calling
GeoDataModule.setup()
multiple times in a row, always results in the same splits, which is expected. However, these splits are different when restarting the current kernel/program/script.Steps to reproduce
random_bbox_assignment
function definition and add the following line underhits = list(dataset.index.intersection(dataset.index.bounds, objects=True))
:x = [hit.object for hit in hits]
.hits
is permuted).L7IrishDataModule
.setup("fit")
on the data module and while debugging.x
somewhere.x
has changed.Version
0.5.1
The text was updated successfully, but these errors were encountered: