Best way to split train/val/test indices/datasets ? #1210

TolgaAktas · 2023-03-30T06:51:33Z

TolgaAktas
Mar 30, 2023

I am trying to formulate the best strategy to create train/val/test splits from the one data directory, where I have subdirectories of shadow images and (shadow-free) clean images. I am intersecting the two in order to obtain pairs of pixel-aligned shadow and shadow-free images. Any existing codebase in torchgeo that I can integrate to do the splits? Should I consider splitting the sampler outputs? Splitting the imagesets would be very limited in terrms of the data points, I was hoping that I can maybe split the sampled batches of patches.

Here's my code to obtain dataloaders on the intersection dataset:

clean_set = Landsat8(root=clean_path,  bands = bands)
shadow_set= Landsat8(root=shadow_path, bands = bands)

allset = shadow_set & clean_set
print(allset)

sampler = RandomBatchGeoSampler(allset,size=(512,512),batch_size=10,length=100)
dataloader = DataLoader(allset, batch_sampler = sampler, collate_fn=stack_samples)

adamjstewart · 2023-03-30T14:38:53Z

adamjstewart
Mar 30, 2023
Maintainer

You're in luck! @pmandiola literally just added several splitting utilities to TorchGeo in #866. These will be included in the 0.5.0 release, but are already in the main branch if you want to experiment with them. See here for the documentation, and let us know if there's any splitting techniques you think are missing.

2 replies

TolgaAktas Aug 16, 2023
Author

I tried random_bbox_assignment and splitting but I am not sure if it is what I was looking for:

I think random_bbox_assignment just takes a dataset with (let's say) 14 images, and assigns them to seperate datasets as 12-2 (in my case)
I tried random_bbox_splitting, but I didn't fully get how that works. When I was asking for 0.8/0.2 ratio it is giving me an "Input proportion must be between 0 and 1." error, which should be satisfied by the values. Printing out the proportion values in BoundingBox.split function, some proportion values read infinitesimally small negative numbers or values that are slightly larger than 1, for this 0.8,0.2 ratio

I tried 0.7/0.3 ratio, which successfully builds two datasets, but I am not sure why the split function gets called 14 times( I am guessing once for each image in the dataset). So I think that just splits the bounding box of each image in the amount of the given proportions and the new datasets only see that much part of the image?

Ideally what I was hoping to do was to split the sampler's output, that is the list of bounding boxes to be sampled by the sampler, into train/val/test splits. I think that can be a more extensive and economic way of using the dataset.

adamjstewart Aug 16, 2023
Maintainer

It may help if I summarize each splitting utility:

random_bbox_assignment: If your dataset has 100 files, a 0.8/0.2 split will put 80 files in one dataset and 20 files in the other dataset. No file will appear in both datasets.
random_bbox_splitting: A 0.8/0.2 split will put 80% of each file in one dataset and the other 20% in the other dataset. Every file will occur in both datasets.
random_grid_cell_assignment: Imagine a grid on top of each file in your dataset. This splitter randomly assigns those cells to your new split datasets.
roi_split: manually split the dataset based on user-chosen bounding boxes for each split.
time_series_split: split along the time dimension.

With that in mind:

Yes, this is correct
Can you give me a minimal reproducible example so I can debug the error you're seeing? This shouldn't happen.
You could do this manually if you want. The tools in torchgeo.datasets.splits are only for splitting datasets, not sampler output. You might also be able to use random_grid_cell_assignment.

lcoandrade · 2023-09-05T15:14:11Z

lcoandrade
Sep 5, 2023

My approach was subclass GeoDataModule to create (60% for training, 20% for validation and 20% for testing) like this:

class CustomGeoDataModule(GeoDataModule):
    def setup(self, stage: str) -> None:
        """Set up datasets.

        Args:
            stage: Either 'fit', 'validate', 'test', or 'predict'.
        """
        self.dataset = self.dataset_class(**self.kwargs)
        
        generator = torch.Generator().manual_seed(0)
        (
            self.train_dataset,
            self.val_dataset,
            self.test_dataset,
        ) = random_bbox_assignment(dataset, [0.6, 0.2, 0.2], generator) # here is the random bbox split
        
        if stage in ["fit"]:
            self.train_batch_sampler = RandomBatchGeoSampler(
                self.train_dataset, self.patch_size, self.batch_size, self.length
            )
        if stage in ["fit", "validate"]:
            self.val_sampler = GridGeoSampler(
                self.val_dataset, self.patch_size, self.patch_size
            )
        if stage in ["test"]:
            self.test_sampler = GridGeoSampler(
                self.test_dataset, self.patch_size, self.patch_size
            )

This custom geodatamodule is made with an intersection dataset (images and labels):

class NAIPImages(RasterDataset):
    filename_glob = "m_*.tif"
    is_image = True
    separate_files = False
    
class ChesapeakeLabels(RasterDataset):
    filename_glob = "m_*.tif"
    is_image = False
    separate_files = False

naip_root = os.path.join(INPUT_DIR, 'naip_images')
naip_images = NAIPImages(
    root=naip_root,
)

chesapeake_root = os.path.join(INPUT_DIR, "chesapeake_labels")
chesapeake_labels = ChesapeakeLabels(
    root=chesapeake_root,
)

dataset = naip_images & chesapeake_labels

Then, my geodatamodule is made like this:

datamodule = CustomGeoDataModule(
    dataset_class = type(dataset), # GeoDataModule kwargs
    batch_size = BATCH_SIZE, # GeoDataModule kwargs
    patch_size = IMG_SIZE, # GeoDataModule kwargs
    length = SAMPLE_SIZE, # GeoDataModule kwargs
    num_workers = WORKERS, # GeoDataModule kwargs
    dataset1 = naip_images, # IntersectionDataset kwargs
    dataset2 = chesapeake_labels, # IntersectionDataset kwargs
    collate_fn = stack_samples, # IntersectionDataset kwargs
)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to split train/val/test indices/datasets ? #1210

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best way to split train/val/test indices/datasets ? #1210

TolgaAktas Mar 30, 2023

Replies: 2 comments · 2 replies

adamjstewart Mar 30, 2023 Maintainer

TolgaAktas Aug 16, 2023 Author

adamjstewart Aug 16, 2023 Maintainer

lcoandrade Sep 5, 2023

TolgaAktas
Mar 30, 2023

Replies: 2 comments 2 replies

adamjstewart
Mar 30, 2023
Maintainer

TolgaAktas Aug 16, 2023
Author

adamjstewart Aug 16, 2023
Maintainer

lcoandrade
Sep 5, 2023