Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset size reduction fixed, updated TargetValidator to match signatures #1250

Merged
merged 99 commits into from
Feb 1, 2022

Conversation

eddiebergman
Copy link
Contributor

@eddiebergman eddiebergman commented Sep 15, 2021

This PR's main goal was to have reduce_dataset_size_if_too_large to use the new splitter but this ended up being easier to correctly type, fix and test as seperated functions.

Part of this was confirming the types it would recieve by going through and fixing the type from entry point AutoML.fit() through to where it called reduced_dataset_size_if_too_large. This lead to having to fix typing in InputValidator as it happens before reduce_dataset_size_if_too_large, where there was some oddities with type casting.

Major changes

  • Autosklearn now subsamples less aggressively than before. Previous version didn't account for the new dtype byte sizes once precision was reduced.
  • Autosklearn now subsamples using a CustomStratifiedShuffleSplit that accounts for unique labels.
  • TargetValidator no longer accepts sparse y, in line with the rest of autosklearn.
  • FeatureValidator no correctly returns the types it specifies from fit and transform
    • fit() returns specifically a FeatureValidator and not generically a BaseEstimator
    • transform now correctly specifies the types it can return, also changing list -> DataFrame to list -> np.ndarray
  • TargetValidator now correctly returns the types it specifies from fit, _fit, transform, inverse_transform, improving typing in automl.py.
    • fit() and _fit() return specifically a TargetValidator and not generically a BaseEstimator
    • transform and inverse_transform no longer accept y of type spmatrix and always return an ndarray

Minor changes

  • AutoML.subsample_if_too_large -> autosklearn.util.data.reduce_dataset_size_if_too_large
  • This function was also split up into to separate functions subasmple and reduce_precision
  • CustomStratifiedShuffleSplit was fixed to respect training size
  • Added specific test to CustomStratifiedShuffleSplit to check for unique labels and respecting training_size
  • Type fixes from Automl.fit():start up until Automl.fit():reduce_dataset_size_if_too_large
    • Also removed the typing prefix, eg. typing.List -> List to prevent clutter on long signatures and doc
  • Update tests for TargetValidator in line with it's cleaned typing and fixes.
  • Use @overload in InputValidator.transform so mypy correctly knows when to expect one return value vs two

Notes:

The types in general have to be updated and confirmed as pointed out in issue #1264

New funcs

# autosklearn/util/data.py

def reduce_dataset_size_if_too_large(
    X: Union[np.ndarray, spmatrix],
    y: np.ndarray,
    memory_limit: int,
    is_classification: bool,
    random_state: Union[int, np.random.RandomState] = None,
    operations: List[str] = ['precision', 'subsample'],
    multiplier: Union[float, int] = 10,
) -> Tuple[Union[np.ndarray, spmatrix], np.ndarray]:
    ...
    
 def reduce_precision(
    X: Union[np.ndarray, spmatrix]
) -> Tuple[Union[np.ndarray, spmatrix], Type]:

def subsample(
    X: SUPPORTED_FEAT_TYPES,
    y: Union[List, np.ndarray, pd.DataFrame, pd.Series],
    is_classification: bool,
    sample_size: Union[float, int],
) -> ) -> Tuple[SUPPORTED_FEAT_TYPES, Union[List, np.ndarray, pd.DataFrame, pd.Series]]:

@codecov
Copy link

codecov bot commented Sep 15, 2021

Codecov Report

Merging #1250 (a1cc277) into development (887bb10) will increase coverage by 0.29%.
The diff coverage is 96.35%.

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #1250      +/-   ##
===============================================
+ Coverage        88.02%   88.31%   +0.29%     
===============================================
  Files              140      140              
  Lines            10973    11070      +97     
===============================================
+ Hits              9659     9777     +118     
+ Misses            1314     1293      -21     
Impacted Files Coverage Δ
autosklearn/evaluation/splitter.py 94.38% <90.00%> (+34.62%) ⬆️
autosklearn/data/validation.py 97.77% <91.66%> (+0.05%) ⬆️
autosklearn/data/target_validator.py 96.66% <94.87%> (-0.42%) ⬇️
autosklearn/automl.py 86.75% <95.45%> (-0.12%) ⬇️
autosklearn/util/data.py 82.08% <98.00%> (+41.54%) ⬆️
autosklearn/data/feature_validator.py 97.50% <100.00%> (ø)
autosklearn/estimators.py 93.45% <100.00%> (+0.03%) ⬆️
autosklearn/experimental/askl2.py 83.33% <100.00%> (ø)
autosklearn/util/logging_.py 88.96% <0.00%> (-1.38%) ⬇️
autosklearn/ensemble_builder.py 85.82% <0.00%> (-1.02%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 887bb10...a1cc277. Read the comment docs.

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more questions :(

Also, do you know what's going on with the unit tests?

@eddiebergman
Copy link
Contributor Author

Two more questions :(

Also, do you know what's going on with the unit tests?

Hmm, not sure why this didn't fail before but spmatrix.todense gives back an np.matrix. It should have been spmatrix.toarray to give a np.ndarray

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one conflict away from being able to merge this :)

@eddiebergman
Copy link
Contributor Author

Conflict resolves, just waiting on tests and analysis

…t size (#1341)

* Added dataset_compression parameter and validation

* Fix docstring

* Updated docstring for `resampling_strategy`

* Updated param def and memory_allocation can now be absolute

* insert newline

* Fix params into one line

* fix indentation in docs

* fix import breaks

* Allow absolute memory_allocation

* Tests

* Update test on for precision omitted from methods

* Update test for akslearn2 with same args

* Update to use TypedDict for better Mypy parsing

* Added arg to asklearn2

* Updated tests to remove some warnings

* flaked

* Fix broken link?

* Remove TypedDict as it's not supported in Python3.7

* Missing import

* Review changes

* Fix magic mock for python < 3.9
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two more minor questions, then we're good to go :)

autosklearn/util/data.py Show resolved Hide resolved
test/test_util/test_data.py Show resolved Hide resolved
@eddiebergman eddiebergman merged commit bdebeca into development Feb 1, 2022
@eddiebergman eddiebergman deleted the use_new_splitter branch February 1, 2022 14:29
github-actions bot pushed a commit that referenced this pull request Feb 1, 2022
github-actions bot pushed a commit that referenced this pull request Feb 2, 2022
mfeurer pushed a commit that referenced this pull request Feb 3, 2022
… and #1250 (#1386)

* Add: Doc for `dataset_compression`

* Fix: Shorten line

* Doc: Make more clear that the argument None still provides defaults
github-actions bot pushed a commit that referenced this pull request Feb 3, 2022
eddiebergman added a commit that referenced this pull request Aug 18, 2022
…ures (#1250)

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Moved to new splitter, moved to util file

* flake8'd

* Fixed errors, added test specifically for CustomStratifiedShuffleSplit

* flake8'd

* Updated docstring

* Updated types in docstring

* reduce_dataset_size_if_too_large supports more types

* flake8'd

* flake8'd

* Updated docstring

* Seperated out the data subsampling into individual functions

* Improved typing from Automl.fit to reduce_dataset_size_if_too_large

* flak8'd

* subsample tested

* Finished testing and flake8'd

* Cleaned up transform function that was touched

* ^

* Removed double typing

* Cleaned up typing of convert_if_sparse

* Cleaned up splitters and added size test

* Cleanup doc in data

* rogue line added was removed

* Test fix

* flake8'd

* Typo fix

* Fixed ordering of things

* Fixed typing and tests of target_validator fit, transform, inv_transform

* Updated doc

* Updated Type return

* Removed elif gaurd

* removed extraneuous overload

* Updated return type of feature validator

* Type fixes for target validator fit

* flake8'd

* Fixed err message str and automl sparse y tests

* Flak8'd

* Fix sort indices

* list type to List

* Remove uneeded comment

* Updated comment to make it more clear

* Comment update

* Fixed warning message for reduce_dataset_if_too_large

* Fix test

* Added check for error message in tests

* Test Updates

* Fix error msg

* reinclude csr y to test

* Reintroduced explicit subsample values test

* flaked

* Missed an uncomment

* Update the comment for test of splitters

* Updated warning message in CustomSplitter

* Update comment in test

* Update tests

* Removed overloads

* Narrowed type of subsample

* Removed overload import

* Fix `todense` giving np.matrix, using `toarray`

* Made subsampling a little less aggresive

* Changed multiplier back to 10

* Allow argument to specfiy how auto-sklearn handles compressing dataset size  (#1341)

* Added dataset_compression parameter and validation

* Fix docstring

* Updated docstring for `resampling_strategy`

* Updated param def and memory_allocation can now be absolute

* insert newline

* Fix params into one line

* fix indentation in docs

* fix import breaks

* Allow absolute memory_allocation

* Tests

* Update test on for precision omitted from methods

* Update test for akslearn2 with same args

* Update to use TypedDict for better Mypy parsing

* Added arg to asklearn2

* Updated tests to remove some warnings

* flaked

* Fix broken link?

* Remove TypedDict as it's not supported in Python3.7

* Missing import

* Review changes

* Fix magic mock for python < 3.9

* Fixed bad merge
eddiebergman added a commit that referenced this pull request Aug 18, 2022
… and #1250 (#1386)

* Add: Doc for `dataset_compression`

* Fix: Shorten line

* Doc: Make more clear that the argument None still provides defaults
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants