Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rare edge case with extremely inbalanced data #1244

Merged
merged 2 commits into from
Sep 13, 2021

Conversation

mfeurer
Copy link
Contributor

@mfeurer mfeurer commented Sep 13, 2021

For dataset 360112 Auto-sklearn would fail because the data would
first be sub-sampled and then contain some classes only once.
In the internal splitting, the StratifiedShuffleSplit would not
be able to split the dataset into train and valid, and would resort
to only a ShuffleSplit. This could put the single sample for a
class into the test set. At predict time we would then miss one class.

This commit creates two new splitters which move a sample from the
test split to the training split if a class does not exist in the
train split.

For dataset 360112 Auto-sklearn would fail because the data would
first be sub-sampled and then contain some classes only once.
In the internal splitting, the StratifiedShuffleSplit would not
be able to split the dataset into train and valid, and would resort
to only a ShuffleSplit. This could put the single sample for a
class into the test set. At predict time we would then miss one class.

This commit creates two new splitters which move a sample from the
test split to the training split if a class does not exist in the
train split.
@mfeurer
Copy link
Contributor Author

mfeurer commented Sep 13, 2021

CC @eddiebergman we could use these splitters to simplify subsample_if_too_large, right?

@codecov
Copy link

codecov bot commented Sep 13, 2021

Codecov Report

Merging #1244 (a88d754) into development (63808ef) will decrease coverage by 0.17%.
The diff coverage is 64.94%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #1244      +/-   ##
===============================================
- Coverage        88.25%   88.08%   -0.18%     
===============================================
  Files              139      140       +1     
  Lines            11034    11128      +94     
===============================================
+ Hits              9738     9802      +64     
- Misses            1296     1326      +30     
Impacted Files Coverage Δ
autosklearn/evaluation/splitter.py 59.75% <59.75%> (ø)
autosklearn/evaluation/train_evaluator.py 92.39% <93.33%> (-0.03%) ⬇️
autosklearn/util/logging_.py 88.96% <0.00%> (+0.68%) ⬆️
...eline/components/feature_preprocessing/fast_ica.py 97.82% <0.00%> (+6.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63808ef...a88d754. Read the comment docs.

@mfeurer mfeurer merged commit ff11e5a into development Sep 13, 2021
@mfeurer mfeurer deleted the edge_case_very_inbalanced_data branch September 13, 2021 18:14
@eddiebergman
Copy link
Contributor

CC @eddiebergman we could use these splitters to simplify subsample_if_too_large, right?

Yes this splitter would probably work in that case too, or at least I see no reason it shouldn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants