Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Implement grouped dataset splits and cross-validation #363

Merged
merged 18 commits into from
Jan 22, 2021

Conversation

dccastro
Copy link
Member

This PR adds the ability to specify a group_column in addition to the primary subject_column when creating DatasetSplits. If given, this ensures that subjects within each group cannot be in separate training/test/validation sets or cross-validation folds.

@dccastro dccastro self-assigned this Jan 19, 2021
@dccastro dccastro requested review from ant0nsc and Shruthi42 January 19, 2021 17:29
@dccastro dccastro marked this pull request as ready for review January 19, 2021 17:30
Copy link
Contributor

@Shruthi42 Shruthi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency across configs, we should consider having a documented default for the group column name in the csv. For example, we have a default subject column name for segmentation dataset csv files, and we have a parameter in the config which is used to specify the subject column name in scalar datasets.


def pairwise_intersection(*collections: Iterable) -> Set:
"""Returns any element that appears in more than one collection."""
from itertools import combinations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason this import is local?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really; well spotted! I'll move it to the top.

CHANGELOG.md Outdated
@@ -12,6 +12,7 @@ created.

### Added
- New extensions of SegmentationModelBases `HeadAndNeckBase` and `ProstateBase`. Use these classes to build your own Head&Neck or Prostate models, by just providing a list of foreground classes.
- Grouped dataset splits and k-fold cross-validation. This allows, for example, training on datasets with multiple images per subject without leaking data from the same subject across train/test/validation sets or cross-validation folds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add here the specific change users will need to make to use this feature in their configs?

@dccastro
Copy link
Member Author

I thought about adding a default group column name, but couldn't immediately come up with a good catch-all solution... I believe in most cases that would be None (i.e. no grouping). In other cases, we might want group_column to refer to subject ID, while subject_column points to image/series ID instead. The other obvious use-case I could see is grouping by data source/institution/hospital/etc. Do you have any suggestions?

Comment on lines +215 to +217
key_column: str,
subject_column: str,
group_column: Optional[str]) -> DatasetSplits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both column names should have "" as the default value, and group_column=None

subject_column: str,
group_column: Optional[str]) -> DatasetSplits:
"""
Takes a slice of values from each data split train/test/val for the provided keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a slice of values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just adapted this docstring from from_subject_ids().

@dccastro dccastro merged commit b320649 into master Jan 22, 2021
@dccastro dccastro deleted the dacoelh/grouped-crossval branch January 22, 2021 16:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants