Fix rare edge case with extremely inbalanced data #1244

mfeurer · 2021-09-13T15:23:36Z

For dataset 360112 Auto-sklearn would fail because the data would
first be sub-sampled and then contain some classes only once.
In the internal splitting, the StratifiedShuffleSplit would not
be able to split the dataset into train and valid, and would resort
to only a ShuffleSplit. This could put the single sample for a
class into the test set. At predict time we would then miss one class.

This commit creates two new splitters which move a sample from the
test split to the training split if a class does not exist in the
train split.

For dataset 360112 Auto-sklearn would fail because the data would first be sub-sampled and then contain some classes only once. In the internal splitting, the StratifiedShuffleSplit would not be able to split the dataset into train and valid, and would resort to only a ShuffleSplit. This could put the single sample for a class into the test set. At predict time we would then miss one class. This commit creates two new splitters which move a sample from the test split to the training split if a class does not exist in the train split.

mfeurer · 2021-09-13T15:42:59Z

CC @eddiebergman we could use these splitters to simplify subsample_if_too_large, right?

codecov · 2021-09-13T15:56:02Z

Codecov Report

Merging #1244 (a88d754) into development (63808ef) will decrease coverage by 0.17%.
The diff coverage is 64.94%.

@@               Coverage Diff               @@
##           development    #1244      +/-   ##
===============================================
- Coverage        88.25%   88.08%   -0.18%     
===============================================
  Files              139      140       +1     
  Lines            11034    11128      +94     
===============================================
+ Hits              9738     9802      +64     
- Misses            1296     1326      +30

Impacted Files	Coverage Δ
autosklearn/evaluation/splitter.py	`59.75% <59.75%> (ø)`
autosklearn/evaluation/train_evaluator.py	`92.39% <93.33%> (-0.03%)`	⬇️
autosklearn/util/logging_.py	`88.96% <0.00%> (+0.68%)`	⬆️
...eline/components/feature_preprocessing/fast_ica.py	`97.82% <0.00%> (+6.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63808ef...a88d754. Read the comment docs.

…1244)

eddiebergman · 2021-09-14T08:36:49Z

CC @eddiebergman we could use these splitters to simplify subsample_if_too_large, right?

Yes this splitter would probably work in that case too, or at least I see no reason it shouldn't.

fix unit test

a88d754

mfeurer merged commit ff11e5a into development Sep 13, 2021

mfeurer deleted the edge_case_very_inbalanced_data branch September 13, 2021 18:14

github-actions bot pushed a commit that referenced this pull request Sep 13, 2021

Matthias Feurer: Fix rare edge case with extremely inbalanced data (#…

776f035

…1244)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rare edge case with extremely inbalanced data #1244

Fix rare edge case with extremely inbalanced data #1244

mfeurer commented Sep 13, 2021

mfeurer commented Sep 13, 2021

codecov bot commented Sep 13, 2021 •

edited

Loading

eddiebergman commented Sep 14, 2021

Fix rare edge case with extremely inbalanced data #1244

Fix rare edge case with extremely inbalanced data #1244

Conversation

mfeurer commented Sep 13, 2021

mfeurer commented Sep 13, 2021

codecov bot commented Sep 13, 2021 • edited Loading

Codecov Report

eddiebergman commented Sep 14, 2021

codecov bot commented Sep 13, 2021 •

edited

Loading