Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

Merged
merged 21 commits into from
Dec 17, 2021

Conversation

eddiebergman
Copy link
Contributor

@eddiebergman eddiebergman commented Dec 8, 2021

Add an argument to AutoSklearnEstimator that allows the user to specify how we perform dataset reduction and whether it should be enabled or not. This also serves as some documentation as to how we perform dataset compression, such as precision reduction, subsampling and what issues may arise.

AutoSklearnClassifier(
    ...,
    dataset_compression: Union[bool, Dict[str, Any]] = True
)
  • Can enable disable with True or `False
  • Can pass in a dict (implying True)
    • "memory_allocation" - A float for a percentage of memory limit or int for absolute memory
    • "methods" - List of of dataset reduction to perform
      • "precision" - perform float precision reductionn
      • "subsample" - subsample the data so that it fits into the memory allocation.
# default
{
    "memory_allocation": 0.1,  # How much of memory to compress the data into
    "methods": ["precision", "subsample"],  # Which methods to use
}

A user can specify only some of these keys or pass None and we fill the rest with these default values. Includes validation of the passed items.

Also updated the docstring for resampling_strategy to inform users that they may wish to disable this if using custom object splitters.

These PR addressess:

@codecov
Copy link

codecov bot commented Dec 8, 2021

Codecov Report

Merging #1341 (f03f76e) into use_new_splitter (42e4397) will decrease coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                 Coverage Diff                  @@
##           use_new_splitter    #1341      +/-   ##
====================================================
- Coverage             88.82%   88.40%   -0.43%     
====================================================
  Files                   140      140              
  Lines                 11912    11245     -667     
====================================================
- Hits                  10581     9941     -640     
+ Misses                 1331     1304      -27     
Impacted Files Coverage Δ
autosklearn/automl.py 86.98% <100.00%> (+0.11%) ⬆️
autosklearn/estimators.py 93.45% <100.00%> (+0.03%) ⬆️
autosklearn/experimental/askl2.py 83.33% <100.00%> (ø)
autosklearn/util/data.py 82.08% <100.00%> (+4.73%) ⬆️
...sklearn/pipeline/components/regression/__init__.py 83.52% <0.00%> (-6.31%) ⬇️
...arn/pipeline/components/classification/__init__.py 84.94% <0.00%> (-5.90%) ⬇️
autosklearn/pipeline/components/base.py 78.78% <0.00%> (-3.33%) ⬇️
autosklearn/evaluation/__init__.py 85.47% <0.00%> (-2.91%) ⬇️
...ata_preprocessing/categorical_encoding/__init__.py 85.71% <0.00%> (-2.80%) ⬇️
...eline/components/feature_preprocessing/__init__.py 89.33% <0.00%> (-2.26%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42e4397...f03f76e. Read the comment docs.

@eddiebergman eddiebergman changed the base branch from development to use_new_splitter December 9, 2021 11:23
autosklearn/estimators.py Outdated Show resolved Hide resolved
autosklearn/estimators.py Outdated Show resolved Hide resolved
autosklearn/automl.py Show resolved Hide resolved
autosklearn/estimators.py Outdated Show resolved Hide resolved
autosklearn/util/data.py Outdated Show resolved Hide resolved
autosklearn/util/data.py Outdated Show resolved Hide resolved
test/test_automl/test_automl.py Show resolved Hide resolved
test/test_automl/test_automl.py Outdated Show resolved Hide resolved
test/test_automl/test_automl.py Show resolved Hide resolved
test/test_util/test_data.py Show resolved Hide resolved
@eddiebergman eddiebergman merged commit 9bcb210 into use_new_splitter Dec 17, 2021
eddiebergman added a commit that referenced this pull request Feb 1, 2022
…ures (#1250)

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Fixed err message str and automl sparse y tests

* Flak8'd

* Fix sort indices

* list type to List

* Remove uneeded comment

* Updated comment to make it more clear

* Comment update

* Fixed warning message for reduce_dataset_if_too_large

* Fix test

* Added check for error message in tests

* Test Updates

* Fix error msg

* reinclude csr y to test

* Reintroduced explicit subsample values test

* flaked

* Missed an uncomment

* Update the comment for test of splitters

* Updated warning message in CustomSplitter

* Update comment in test

* Update tests

* Removed overloads

* Narrowed type of subsample

* Removed overload import

* Fix `todense` giving np.matrix, using `toarray`

* Made subsampling a little less aggresive

* Changed multiplier back to 10

* Allow argument to specfiy how auto-sklearn handles compressing dataset size  (#1341)

* Added dataset_compression parameter and validation

* Fix docstring

* Updated docstring for `resampling_strategy`

* Updated param def and memory_allocation can now be absolute

* insert newline

* Fix params into one line

* fix indentation in docs

* fix import breaks

* Allow absolute memory_allocation

* Tests

* Update test on for precision omitted from methods

* Update test for akslearn2 with same args

* Update to use TypedDict for better Mypy parsing

* Added arg to asklearn2

* Updated tests to remove some warnings

* flaked

* Fix broken link?

* Remove TypedDict as it's not supported in Python3.7

* Missing import

* Review changes

* Fix magic mock for python < 3.9

* Fixed bad merge
mfeurer pushed a commit that referenced this pull request Feb 3, 2022
… and #1250 (#1386)

* Add: Doc for `dataset_compression`

* Fix: Shorten line

* Doc: Make more clear that the argument None still provides defaults
github-actions bot pushed a commit that referenced this pull request Feb 3, 2022
@eddiebergman eddiebergman deleted the allow_resampling_arg branch February 9, 2022 14:27
eddiebergman added a commit that referenced this pull request Aug 18, 2022
…ures (#1250)

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Fixed err message str and automl sparse y tests

* Flak8'd

* Fix sort indices

* list type to List

* Remove uneeded comment

* Updated comment to make it more clear

* Comment update

* Fixed warning message for reduce_dataset_if_too_large

* Fix test

* Added check for error message in tests

* Test Updates

* Fix error msg

* reinclude csr y to test

* Reintroduced explicit subsample values test

* flaked

* Missed an uncomment

* Update the comment for test of splitters

* Updated warning message in CustomSplitter

* Update comment in test

* Update tests

* Removed overloads

* Narrowed type of subsample

* Removed overload import

* Fix `todense` giving np.matrix, using `toarray`

* Made subsampling a little less aggresive

* Changed multiplier back to 10

* Allow argument to specfiy how auto-sklearn handles compressing dataset size  (#1341)

* Added dataset_compression parameter and validation

* Fix docstring

* Updated docstring for `resampling_strategy`

* Updated param def and memory_allocation can now be absolute

* insert newline

* Fix params into one line

* fix indentation in docs

* fix import breaks

* Allow absolute memory_allocation

* Tests

* Update test on for precision omitted from methods

* Update test for akslearn2 with same args

* Update to use TypedDict for better Mypy parsing

* Added arg to asklearn2

* Updated tests to remove some warnings

* flaked

* Fix broken link?

* Remove TypedDict as it's not supported in Python3.7

* Missing import

* Review changes

* Fix magic mock for python < 3.9

* Fixed bad merge
eddiebergman added a commit that referenced this pull request Aug 18, 2022
… and #1250 (#1386)

* Add: Doc for `dataset_compression`

* Fix: Shorten line

* Doc: Make more clear that the argument None still provides defaults
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants