Implement grouped dataset splits and cross-validation #363

dccastro · 2021-01-18T18:10:41Z

This PR adds the ability to specify a group_column in addition to the primary subject_column when creating DatasetSplits. If given, this ensures that subjects within each group cannot be in separate training/test/validation sets or cross-validation folds.

The expression 'all([len(x[mode]) >= 1] for mode in x.keys())' will always evaluate to True, because 'bool([False]) == True'.

Previously, it erroneously checked for empty three-way intersection of train, test, and val, whereas the correct check is for pairwise intersections: train-test, train-val, and test-val.

This employs scikit-learn's GroupKFold class.

…ping

Shruthi42

For consistency across configs, we should consider having a documented default for the group column name in the csv. For example, we have a default subject column name for segmentation dataset csv files, and we have a parameter in the config which is used to specify the subject column name in scalar datasets.

Shruthi42 · 2021-01-20T12:37:15Z

InnerEye/ML/utils/split_dataset.py

+
+        def pairwise_intersection(*collections: Iterable) -> Set:
+            """Returns any element that appears in more than one collection."""
+            from itertools import combinations


Is there a specific reason this import is local?

Not really; well spotted! I'll move it to the top.

Shruthi42 · 2021-01-20T13:38:12Z

CHANGELOG.md

@@ -12,6 +12,7 @@ created.

 ### Added
 - New extensions of SegmentationModelBases `HeadAndNeckBase` and `ProstateBase`. Use these classes to build your own Head&Neck or Prostate models, by just providing a list of foreground classes.
+- Grouped dataset splits and k-fold cross-validation. This allows, for example, training on datasets with multiple images per subject without leaking data from the same subject across train/test/validation sets or cross-validation folds.


Can we add here the specific change users will need to make to use this feature in their configs?

dccastro · 2021-01-21T15:30:39Z

I thought about adding a default group column name, but couldn't immediately come up with a good catch-all solution... I believe in most cases that would be None (i.e. no grouping). In other cases, we might want group_column to refer to subject ID, while subject_column points to image/series ID instead. The other obvious use-case I could see is grouping by data source/institution/hospital/etc. Do you have any suggestions?

…ssval

ant0nsc · 2021-01-22T16:41:34Z

InnerEye/ML/utils/split_dataset.py

+                         key_column: str,
+                         subject_column: str,
+                         group_column: Optional[str]) -> DatasetSplits:


both column names should have "" as the default value, and group_column=None

ant0nsc · 2021-01-22T16:41:52Z

InnerEye/ML/utils/split_dataset.py

+                         subject_column: str,
+                         group_column: Optional[str]) -> DatasetSplits:
+        """
+        Takes a slice of values from each data split train/test/val for the provided keys.


What's a slice of values?

I just adapted this docstring from from_subject_ids().

dccastro added 13 commits January 15, 2021 12:57

Fix vacuous test for at-least-one in dataset split

4581725

The expression 'all([len(x[mode]) >= 1] for mode in x.keys())' will always evaluate to True, because 'bool([False]) == True'.

Fix post-init validation of pairwise split intersections

ef86970

Previously, it erroneously checked for empty three-way intersection of train, test, and val, whereas the correct check is for pairwise intersections: train-test, train-val, and test-val.

Add group_column and validation logic

af4cbfd

Add method to split dataset by arbitrary key column

5644714

Delegate DatasetSplits.from_subject_ids to _from_split_keys

b25ffc0

Add DatasetSplits.from_groups convenience method

425de87

Add grouping logic to DatasetSplits.from_proportions

d966ba1

Implement grouped k-fold cross-validation

a98b822

This employs scikit-learn's GroupKFold class.

Add tests for grouped splits and grouped k-fold crossval

0b39867

Update changelog

fb38e4c

Fix mypy warnings

a0a3989

Document that restricted and by-institution splits don't support grou…

0ed76ba

…ping

Add validation of groups in test data

fd1ea81

dccastro self-assigned this Jan 19, 2021

Merge branch 'master' into dacoelh/grouped-crossval

7786464

dccastro requested review from ant0nsc and Shruthi42 January 19, 2021 17:29

dccastro marked this pull request as ready for review January 19, 2021 17:30

Shruthi42 reviewed Jan 21, 2021

View reviewed changes

dccastro added 4 commits January 21, 2021 15:34

Move itertools.combinations import to top of file

2d4b857

Add grouped splits usage instructions to changelog

655e4fa

Merge remote-tracking branch 'origin/master' into dacoelh/grouped-cro…

f3af1f5

…ssval

Move itertools.combinations import to top of test file

fe81f02

Shruthi42 approved these changes Jan 21, 2021

View reviewed changes

ant0nsc approved these changes Jan 22, 2021

View reviewed changes

dccastro merged commit b320649 into master Jan 22, 2021

dccastro deleted the dacoelh/grouped-crossval branch January 22, 2021 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement grouped dataset splits and cross-validation #363

Implement grouped dataset splits and cross-validation #363

dccastro commented Jan 18, 2021

Shruthi42 left a comment

Shruthi42 Jan 20, 2021

dccastro Jan 21, 2021

Shruthi42 Jan 20, 2021

dccastro commented Jan 21, 2021

ant0nsc Jan 22, 2021

ant0nsc Jan 22, 2021

dccastro Jan 22, 2021

Implement grouped dataset splits and cross-validation #363

Implement grouped dataset splits and cross-validation #363

Conversation

dccastro commented Jan 18, 2021

Shruthi42 left a comment

Choose a reason for hiding this comment

Shruthi42 Jan 20, 2021

Choose a reason for hiding this comment

dccastro Jan 21, 2021

Choose a reason for hiding this comment

Shruthi42 Jan 20, 2021

Choose a reason for hiding this comment

dccastro commented Jan 21, 2021

ant0nsc Jan 22, 2021

Choose a reason for hiding this comment

ant0nsc Jan 22, 2021

Choose a reason for hiding this comment

dccastro Jan 22, 2021

Choose a reason for hiding this comment