Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

eddiebergman · 2021-12-08T22:30:46Z

Builds off Dataset size reduction fixed, updated TargetValidator to match signatures #1250 which introduces parametrized dataset compression and uses the custom splitters.
Still requires documentation, should probably wait until use_new_splitter is merged so the new updated docs can be written to.

Add an argument to AutoSklearnEstimator that allows the user to specify how we perform dataset reduction and whether it should be enabled or not. This also serves as some documentation as to how we perform dataset compression, such as precision reduction, subsampling and what issues may arise.

AutoSklearnClassifier(
    ...,
    dataset_compression: Union[bool, Dict[str, Any]] = True
)

Can enable disable with True or `False
Can pass in a dict (implying True)
- "memory_allocation" - A float for a percentage of memory limit or int for absolute memory
- "methods" - List of of dataset reduction to perform
  - "precision" - perform float precision reductionn
  - "subsample" - subsample the data so that it fits into the memory allocation.

# default
{
    "memory_allocation": 0.1,  # How much of memory to compress the data into
    "methods": ["precision", "subsample"],  # Which methods to use
}

A user can specify only some of these keys or pass None and we fill the rest with these default values. Includes validation of the passed items.

Also updated the docstring for resampling_strategy to inform users that they may wish to disable this if using custom object splitters.

These PR addressess:

Getting only TIMEOUT for PredefinedSplit #1274 - User trying to use predefined split, would fail if the dataset gets reduced.
memory_limit interferes with "resampling_strategy=GroupKFold" #1137 - The groups specified no longer align with the reduced dataset size.
User has no way to control dataset size reduction for large datasets and small memory #1279 - Directly mentions what this PR addresses

codecov · 2021-12-08T22:55:08Z

Codecov Report

Merging #1341 (f03f76e) into use_new_splitter (42e4397) will decrease coverage by 0.42%.
The diff coverage is 100.00%.

@@                 Coverage Diff                  @@
##           use_new_splitter    #1341      +/-   ##
====================================================
- Coverage             88.82%   88.40%   -0.43%     
====================================================
  Files                   140      140              
  Lines                 11912    11245     -667     
====================================================
- Hits                  10581     9941     -640     
+ Misses                 1331     1304      -27

Impacted Files	Coverage Δ
autosklearn/automl.py	`86.98% <100.00%> (+0.11%)`	⬆️
autosklearn/estimators.py	`93.45% <100.00%> (+0.03%)`	⬆️
autosklearn/experimental/askl2.py	`83.33% <100.00%> (ø)`
autosklearn/util/data.py	`82.08% <100.00%> (+4.73%)`	⬆️
...sklearn/pipeline/components/regression/__init__.py	`83.52% <0.00%> (-6.31%)`	⬇️
...arn/pipeline/components/classification/__init__.py	`84.94% <0.00%> (-5.90%)`	⬇️
autosklearn/pipeline/components/base.py	`78.78% <0.00%> (-3.33%)`	⬇️
autosklearn/evaluation/__init__.py	`85.47% <0.00%> (-2.91%)`	⬇️
...ata_preprocessing/categorical_encoding/__init__.py	`85.71% <0.00%> (-2.80%)`	⬇️
...eline/components/feature_preprocessing/__init__.py	`89.33% <0.00%> (-2.26%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42e4397...f03f76e. Read the comment docs.

autosklearn/estimators.py

autosklearn/automl.py

autosklearn/estimators.py

autosklearn/util/data.py

test/test_automl/test_automl.py

test/test_util/test_data.py

…ures (#1250) * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Fixed err message str and automl sparse y tests * Flak8'd * Fix sort indices * list type to List * Remove uneeded comment * Updated comment to make it more clear * Comment update * Fixed warning message for reduce_dataset_if_too_large * Fix test * Added check for error message in tests * Test Updates * Fix error msg * reinclude csr y to test * Reintroduced explicit subsample values test * flaked * Missed an uncomment * Update the comment for test of splitters * Updated warning message in CustomSplitter * Update comment in test * Update tests * Removed overloads * Narrowed type of subsample * Removed overload import * Fix `todense` giving np.matrix, using `toarray` * Made subsampling a little less aggresive * Changed multiplier back to 10 * Allow argument to specfiy how auto-sklearn handles compressing dataset size (#1341) * Added dataset_compression parameter and validation * Fix docstring * Updated docstring for `resampling_strategy` * Updated param def and memory_allocation can now be absolute * insert newline * Fix params into one line * fix indentation in docs * fix import breaks * Allow absolute memory_allocation * Tests * Update test on for precision omitted from methods * Update test for akslearn2 with same args * Update to use TypedDict for better Mypy parsing * Added arg to asklearn2 * Updated tests to remove some warnings * flaked * Fix broken link? * Remove TypedDict as it's not supported in Python3.7 * Missing import * Review changes * Fix magic mock for python < 3.9 * Fixed bad merge

… and #1250 (#1386) * Add: Doc for `dataset_compression` * Fix: Shorten line * Doc: Make more clear that the argument None still provides defaults

…gument from #1341 and #1250 (#1386)

…ures (#1250) * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Fixed err message str and automl sparse y tests * Flak8'd * Fix sort indices * list type to List * Remove uneeded comment * Updated comment to make it more clear * Comment update * Fixed warning message for reduce_dataset_if_too_large * Fix test * Added check for error message in tests * Test Updates * Fix error msg * reinclude csr y to test * Reintroduced explicit subsample values test * flaked * Missed an uncomment * Update the comment for test of splitters * Updated warning message in CustomSplitter * Update comment in test * Update tests * Removed overloads * Narrowed type of subsample * Removed overload import * Fix `todense` giving np.matrix, using `toarray` * Made subsampling a little less aggresive * Changed multiplier back to 10 * Allow argument to specfiy how auto-sklearn handles compressing dataset size (#1341) * Added dataset_compression parameter and validation * Fix docstring * Updated docstring for `resampling_strategy` * Updated param def and memory_allocation can now be absolute * insert newline * Fix params into one line * fix indentation in docs * fix import breaks * Allow absolute memory_allocation * Tests * Update test on for precision omitted from methods * Update test for akslearn2 with same args * Update to use TypedDict for better Mypy parsing * Added arg to asklearn2 * Updated tests to remove some warnings * flaked * Fix broken link? * Remove TypedDict as it's not supported in Python3.7 * Missing import * Review changes * Fix magic mock for python < 3.9 * Fixed bad merge

… and #1250 (#1386) * Add: Doc for `dataset_compression` * Fix: Shorten line * Doc: Make more clear that the argument None still provides defaults

eddiebergman added 3 commits December 8, 2021 23:08

Added dataset_compression parameter and validation

5f1700e

Fix docstring

cddc164

Updated docstring for resampling_strategy

7598f69

eddiebergman added PR: In progress labels Dec 8, 2021

eddiebergman linked an issue Dec 8, 2021 that may be closed by this pull request

Getting only TIMEOUT for PredefinedSplit #1274

Closed

eddiebergman removed a link to an issue Dec 8, 2021

Getting only TIMEOUT for PredefinedSplit #1274

Closed

eddiebergman linked an issue Dec 8, 2021 that may be closed by this pull request

No support for precision reduction when reducing dataset size for pandas dataframe or series. #1278

Open

eddiebergman removed a link to an issue Dec 8, 2021

No support for precision reduction when reducing dataset size for pandas dataframe or series. #1278

Open

This was linked to issues Dec 8, 2021

Getting only TIMEOUT for PredefinedSplit #1274

Closed

User has no way to control dataset size reduction for large datasets and small memory #1279

Closed

memory_limit interferes with "resampling_strategy=GroupKFold" #1137

Closed

Updated param def and memory_allocation can now be absolute

a22ccba

eddiebergman changed the base branch from development to use_new_splitter December 9, 2021 11:23

eddiebergman added 15 commits December 9, 2021 12:24

insert newline

6baeff3

Fix params into one line

9fc03f4

fix indentation in docs

06e33a6

fix import breaks

4af4001

Allow absolute memory_allocation

9dd0f9c

Tests

e8e8eb6

Update test on for precision omitted from methods

d949ede

Update test for akslearn2 with same args

128f8a9

Update to use TypedDict for better Mypy parsing

a993b90

Added arg to asklearn2

3391b22

Updated tests to remove some warnings

35a2f59

flaked

80be8d0

Fix broken link?

81535de

Remove TypedDict as it's not supported in Python3.7

c57ca64

Missing import

7b43711

mfeurer reviewed Dec 14, 2021

View reviewed changes

eddiebergman added 2 commits December 14, 2021 10:41

Review changes

053b6c2

Fix magic mock for python < 3.9

f03f76e

mfeurer approved these changes Dec 14, 2021

View reviewed changes

eddiebergman merged commit 9bcb210 into use_new_splitter Dec 17, 2021

eddiebergman mentioned this pull request Feb 2, 2022

memory_limit interferes with "resampling_strategy=GroupKFold" #1137

Closed

github-actions bot pushed a commit that referenced this pull request Feb 3, 2022

Eddie Bergman: Doc: Adds documentation for the dataset compression ar…

94cd247

…gument from #1341 and #1250 (#1386)

eddiebergman deleted the allow_resampling_arg branch February 9, 2022 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

eddiebergman commented Dec 8, 2021 •

edited

Loading

codecov bot commented Dec 8, 2021 •

edited

Loading

Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

Conversation

eddiebergman commented Dec 8, 2021 • edited Loading

codecov bot commented Dec 8, 2021 • edited Loading

Codecov Report

eddiebergman commented Dec 8, 2021 •

edited

Loading

codecov bot commented Dec 8, 2021 •

edited

Loading