Dataset size reduction fixed, updated TargetValidator to match signatures #1250

eddiebergman · 2021-09-15T15:05:31Z

This PR's main goal was to have reduce_dataset_size_if_too_large to use the new splitter but this ended up being easier to correctly type, fix and test as seperated functions.

Part of this was confirming the types it would recieve by going through and fixing the type from entry point AutoML.fit() through to where it called reduced_dataset_size_if_too_large. This lead to having to fix typing in InputValidator as it happens before reduce_dataset_size_if_too_large, where there was some oddities with type casting.

Major changes

Autosklearn now subsamples less aggressively than before. Previous version didn't account for the new dtype byte sizes once precision was reduced.
Autosklearn now subsamples using a CustomStratifiedShuffleSplit that accounts for unique labels.
TargetValidator no longer accepts sparse y, in line with the rest of autosklearn.
FeatureValidator no correctly returns the types it specifies from fit and transform
- fit() returns specifically a FeatureValidator and not generically a BaseEstimator
- transform now correctly specifies the types it can return, also changing list -> DataFrame to list -> np.ndarray
TargetValidator now correctly returns the types it specifies from fit, _fit, transform, inverse_transform, improving typing in automl.py.
- fit() and _fit() return specifically a TargetValidator and not generically a BaseEstimator
- transform and inverse_transform no longer accept y of type spmatrix and always return an ndarray

Minor changes

AutoML.subsample_if_too_large -> autosklearn.util.data.reduce_dataset_size_if_too_large
This function was also split up into to separate functions subasmple and reduce_precision
CustomStratifiedShuffleSplit was fixed to respect training size
Added specific test to CustomStratifiedShuffleSplit to check for unique labels and respecting training_size
Type fixes from Automl.fit():start up until Automl.fit():reduce_dataset_size_if_too_large
- Also removed the typing prefix, eg. typing.List -> List to prevent clutter on long signatures and doc
Update tests for TargetValidator in line with it's cleaned typing and fixes.
Use @overload in InputValidator.transform so mypy correctly knows when to expect one return value vs two

Notes:

The types in general have to be updated and confirmed as pointed out in issue #1264

New funcs

# autosklearn/util/data.py

def reduce_dataset_size_if_too_large(
    X: Union[np.ndarray, spmatrix],
    y: np.ndarray,
    memory_limit: int,
    is_classification: bool,
    random_state: Union[int, np.random.RandomState] = None,
    operations: List[str] = ['precision', 'subsample'],
    multiplier: Union[float, int] = 10,
) -> Tuple[Union[np.ndarray, spmatrix], np.ndarray]:
    ...
    
 def reduce_precision(
    X: Union[np.ndarray, spmatrix]
) -> Tuple[Union[np.ndarray, spmatrix], Type]:

def subsample(
    X: SUPPORTED_FEAT_TYPES,
    y: Union[List, np.ndarray, pd.DataFrame, pd.Series],
    is_classification: bool,
    sample_size: Union[float, int],
) -> ) -> Tuple[SUPPORTED_FEAT_TYPES, Union[List, np.ndarray, pd.DataFrame, pd.Series]]:

codecov · 2021-09-15T15:38:40Z

Codecov Report

Merging #1250 (a1cc277) into development (887bb10) will increase coverage by 0.29%.
The diff coverage is 96.35%.

@@               Coverage Diff               @@
##           development    #1250      +/-   ##
===============================================
+ Coverage        88.02%   88.31%   +0.29%     
===============================================
  Files              140      140              
  Lines            10973    11070      +97     
===============================================
+ Hits              9659     9777     +118     
+ Misses            1314     1293      -21

Impacted Files	Coverage Δ
autosklearn/evaluation/splitter.py	`94.38% <90.00%> (+34.62%)`	⬆️
autosklearn/data/validation.py	`97.77% <91.66%> (+0.05%)`	⬆️
autosklearn/data/target_validator.py	`96.66% <94.87%> (-0.42%)`	⬇️
autosklearn/automl.py	`86.75% <95.45%> (-0.12%)`	⬇️
autosklearn/util/data.py	`82.08% <98.00%> (+41.54%)`	⬆️
autosklearn/data/feature_validator.py	`97.50% <100.00%> (ø)`
autosklearn/estimators.py	`93.45% <100.00%> (+0.03%)`	⬆️
autosklearn/experimental/askl2.py	`83.33% <100.00%> (ø)`
autosklearn/util/logging_.py	`88.96% <0.00%> (-1.38%)`	⬇️
autosklearn/ensemble_builder.py	`85.82% <0.00%> (-1.02%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 887bb10...a1cc277. Read the comment docs.

autosklearn/util/data.py

mfeurer

Two more questions :(

Also, do you know what's going on with the unit tests?

eddiebergman · 2021-11-05T12:52:29Z

Two more questions :(

Also, do you know what's going on with the unit tests?

Hmm, not sure why this didn't fail before but spmatrix.todense gives back an np.matrix. It should have been spmatrix.toarray to give a np.ndarray

mfeurer

Just one conflict away from being able to merge this :)

eddiebergman · 2021-11-05T15:07:16Z

Conflict resolves, just waiting on tests and analysis

…t size (#1341) * Added dataset_compression parameter and validation * Fix docstring * Updated docstring for `resampling_strategy` * Updated param def and memory_allocation can now be absolute * insert newline * Fix params into one line * fix indentation in docs * fix import breaks * Allow absolute memory_allocation * Tests * Update test on for precision omitted from methods * Update test for akslearn2 with same args * Update to use TypedDict for better Mypy parsing * Added arg to asklearn2 * Updated tests to remove some warnings * flaked * Fix broken link? * Remove TypedDict as it's not supported in Python3.7 * Missing import * Review changes * Fix magic mock for python < 3.9

mfeurer

Just two more minor questions, then we're good to go :)

autosklearn/util/data.py

test/test_util/test_data.py

…to match signatures (#1250)

… and #1250 (#1386) * Add: Doc for `dataset_compression` * Fix: Shorten line * Doc: Make more clear that the argument None still provides defaults

…gument from #1341 and #1250 (#1386)

…ures (#1250) * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Fixed err message str and automl sparse y tests * Flak8'd * Fix sort indices * list type to List * Remove uneeded comment * Updated comment to make it more clear * Comment update * Fixed warning message for reduce_dataset_if_too_large * Fix test * Added check for error message in tests * Test Updates * Fix error msg * reinclude csr y to test * Reintroduced explicit subsample values test * flaked * Missed an uncomment * Update the comment for test of splitters * Updated warning message in CustomSplitter * Update comment in test * Update tests * Removed overloads * Narrowed type of subsample * Removed overload import * Fix `todense` giving np.matrix, using `toarray` * Made subsampling a little less aggresive * Changed multiplier back to 10 * Allow argument to specfiy how auto-sklearn handles compressing dataset size (#1341) * Added dataset_compression parameter and validation * Fix docstring * Updated docstring for `resampling_strategy` * Updated param def and memory_allocation can now be absolute * insert newline * Fix params into one line * fix indentation in docs * fix import breaks * Allow absolute memory_allocation * Tests * Update test on for precision omitted from methods * Update test for akslearn2 with same args * Update to use TypedDict for better Mypy parsing * Added arg to asklearn2 * Updated tests to remove some warnings * flaked * Fix broken link? * Remove TypedDict as it's not supported in Python3.7 * Missing import * Review changes * Fix magic mock for python < 3.9 * Fixed bad merge

… and #1250 (#1386) * Add: Doc for `dataset_compression` * Fix: Shorten line * Doc: Make more clear that the argument None still provides defaults

eddiebergman added 2 commits September 15, 2021 17:04

Moved to new splitter, moved to util file

9369343

flake8'd

c2be383

eddiebergman added 6 commits September 16, 2021 11:12

Fixed errors, added test specifically for CustomStratifiedShuffleSplit

6e6a607

flake8'd

786e508

Updated docstring

58dc49b

Updated types in docstring

4bce38f

reduce_dataset_size_if_too_large supports more types

206c3df

flake8'd

d6f018f

eddiebergman added the PR: In progress label Oct 5, 2021

flake8'd

6ed3e2c

eddiebergman added PR: Review Ready and removed PR: In progress labels Oct 5, 2021

Updated docstring

5981fad

eddiebergman added PR: In progress and removed PR: Review Ready labels Oct 6, 2021

eddiebergman added 14 commits October 6, 2021 21:41

Seperated out the data subsampling into individual functions

65c8667

Improved typing from Automl.fit to reduce_dataset_size_if_too_large

f130424

flak8'd

9b6f613

subsample tested

a12cf33

Finished testing and flake8'd

077cb2c

Cleaned up transform function that was touched

9af22a7

^

8057766

Removed double typing

e1cce3f

Cleaned up typing of convert_if_sparse

c8693a9

Cleaned up splitters and added size test

2591cc2

Cleanup doc in data

a6cc39f

rogue line added was removed

f987c65

Test fix

3c4964a

flake8'd

a53b1e5

mfeurer reviewed Nov 5, 2021

View reviewed changes

autosklearn/util/data.py Show resolved Hide resolved

mfeurer reviewed Nov 5, 2021

View reviewed changes

autosklearn/util/data.py Show resolved Hide resolved

mfeurer reviewed Nov 5, 2021

View reviewed changes

Fix todense giving np.matrix, using toarray

3d21282

mfeurer approved these changes Nov 5, 2021

View reviewed changes

Merge branch 'development' into use_new_splitter

e1317b1

eddiebergman added 2 commits November 14, 2021 17:12

Made subsampling a little less aggresive

f56356d

Changed multiplier back to 10

42e4397

eddiebergman added PR: Minor and removed PR: Minor labels Dec 1, 2021

eddiebergman mentioned this pull request Dec 8, 2021

Allow argument to specfiy how auto-sklearn handles compressing dataset size #1341

Merged

eddiebergman added the PR: Performance label Dec 13, 2021

eddiebergman added 3 commits December 17, 2021 23:52

Merge branch 'development' into use_new_splitter

2cd1d48

Fixed bad merge

a1cc277

mfeurer reviewed Jan 31, 2022

View reviewed changes

autosklearn/util/data.py Show resolved Hide resolved

test/test_util/test_data.py Show resolved Hide resolved

mfeurer approved these changes Feb 1, 2022

View reviewed changes

eddiebergman merged commit bdebeca into development Feb 1, 2022

eddiebergman mentioned this pull request Feb 1, 2022

[Doc] Accessible data reduction parameters #1383

Closed

eddiebergman deleted the use_new_splitter branch February 1, 2022 14:29

github-actions bot pushed a commit that referenced this pull request Feb 1, 2022

Eddie Bergman: Dataset size reduction fixed, updated TargetValidator …

9bf796f

…to match signatures (#1250)

github-actions bot pushed a commit that referenced this pull request Feb 2, 2022

Eddie Bergman: Dataset size reduction fixed, updated TargetValidator …

7cef83d

…to match signatures (#1250)

github-actions bot pushed a commit that referenced this pull request Feb 3, 2022

Eddie Bergman: Doc: Adds documentation for the dataset compression ar…

94cd247

…gument from #1341 and #1250 (#1386)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset size reduction fixed, updated TargetValidator to match signatures #1250

Dataset size reduction fixed, updated TargetValidator to match signatures #1250

eddiebergman commented Sep 15, 2021 •

edited

Loading

codecov bot commented Sep 15, 2021 •

edited

Loading

mfeurer left a comment

eddiebergman commented Nov 5, 2021

mfeurer left a comment

eddiebergman commented Nov 5, 2021

mfeurer left a comment

Dataset size reduction fixed, updated TargetValidator to match signatures #1250

Dataset size reduction fixed, updated TargetValidator to match signatures #1250

Conversation

eddiebergman commented Sep 15, 2021 • edited Loading

Major changes

Minor changes

Notes:

New funcs

codecov bot commented Sep 15, 2021 • edited Loading

Codecov Report

mfeurer left a comment

Choose a reason for hiding this comment

eddiebergman commented Nov 5, 2021

mfeurer left a comment

Choose a reason for hiding this comment

eddiebergman commented Nov 5, 2021

mfeurer left a comment

Choose a reason for hiding this comment

eddiebergman commented Sep 15, 2021 •

edited

Loading

codecov bot commented Sep 15, 2021 •

edited

Loading