-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset size reduction fixed, updated TargetValidator to match signatures #1250
Conversation
Codecov Report
@@ Coverage Diff @@
## development #1250 +/- ##
===============================================
+ Coverage 88.02% 88.31% +0.29%
===============================================
Files 140 140
Lines 10973 11070 +97
===============================================
+ Hits 9659 9777 +118
+ Misses 1314 1293 -21
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more questions :(
Also, do you know what's going on with the unit tests?
Hmm, not sure why this didn't fail before but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one conflict away from being able to merge this :)
Conflict resolves, just waiting on tests and analysis |
…t size (#1341) * Added dataset_compression parameter and validation * Fix docstring * Updated docstring for `resampling_strategy` * Updated param def and memory_allocation can now be absolute * insert newline * Fix params into one line * fix indentation in docs * fix import breaks * Allow absolute memory_allocation * Tests * Update test on for precision omitted from methods * Update test for akslearn2 with same args * Update to use TypedDict for better Mypy parsing * Added arg to asklearn2 * Updated tests to remove some warnings * flaked * Fix broken link? * Remove TypedDict as it's not supported in Python3.7 * Missing import * Review changes * Fix magic mock for python < 3.9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two more minor questions, then we're good to go :)
…to match signatures (#1250)
…to match signatures (#1250)
…ures (#1250) * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Moved to new splitter, moved to util file * flake8'd * Fixed errors, added test specifically for CustomStratifiedShuffleSplit * flake8'd * Updated docstring * Updated types in docstring * reduce_dataset_size_if_too_large supports more types * flake8'd * flake8'd * Updated docstring * Seperated out the data subsampling into individual functions * Improved typing from Automl.fit to reduce_dataset_size_if_too_large * flak8'd * subsample tested * Finished testing and flake8'd * Cleaned up transform function that was touched * ^ * Removed double typing * Cleaned up typing of convert_if_sparse * Cleaned up splitters and added size test * Cleanup doc in data * rogue line added was removed * Test fix * flake8'd * Typo fix * Fixed ordering of things * Fixed typing and tests of target_validator fit, transform, inv_transform * Updated doc * Updated Type return * Removed elif gaurd * removed extraneuous overload * Updated return type of feature validator * Type fixes for target validator fit * flake8'd * Fixed err message str and automl sparse y tests * Flak8'd * Fix sort indices * list type to List * Remove uneeded comment * Updated comment to make it more clear * Comment update * Fixed warning message for reduce_dataset_if_too_large * Fix test * Added check for error message in tests * Test Updates * Fix error msg * reinclude csr y to test * Reintroduced explicit subsample values test * flaked * Missed an uncomment * Update the comment for test of splitters * Updated warning message in CustomSplitter * Update comment in test * Update tests * Removed overloads * Narrowed type of subsample * Removed overload import * Fix `todense` giving np.matrix, using `toarray` * Made subsampling a little less aggresive * Changed multiplier back to 10 * Allow argument to specfiy how auto-sklearn handles compressing dataset size (#1341) * Added dataset_compression parameter and validation * Fix docstring * Updated docstring for `resampling_strategy` * Updated param def and memory_allocation can now be absolute * insert newline * Fix params into one line * fix indentation in docs * fix import breaks * Allow absolute memory_allocation * Tests * Update test on for precision omitted from methods * Update test for akslearn2 with same args * Update to use TypedDict for better Mypy parsing * Added arg to asklearn2 * Updated tests to remove some warnings * flaked * Fix broken link? * Remove TypedDict as it's not supported in Python3.7 * Missing import * Review changes * Fix magic mock for python < 3.9 * Fixed bad merge
This PR's main goal was to have
reduce_dataset_size_if_too_large
to use the new splitter but this ended up being easier to correctly type, fix and test as seperated functions.Part of this was confirming the types it would recieve by going through and fixing the type from entry point
AutoML.fit()
through to where it calledreduced_dataset_size_if_too_large
. This lead to having to fix typing inInputValidator
as it happens beforereduce_dataset_size_if_too_large
, where there was some oddities with type casting.Major changes
CustomStratifiedShuffleSplit
that accounts for unique labels.TargetValidator
no longer accepts sparse y, in line with the rest of autosklearn.FeatureValidator
no correctly returns the types it specifies fromfit
andtransform
fit()
returns specifically aFeatureValidator
and not generically aBaseEstimator
transform
now correctly specifies the types it can return, also changinglist -> DataFrame
tolist -> np.ndarray
TargetValidator
now correctly returns the types it specifies fromfit
,_fit
,transform
,inverse_transform
, improving typing inautoml.py
.fit()
and_fit()
return specifically aTargetValidator
and not generically aBaseEstimator
transform
andinverse_transform
no longer accept y of typespmatrix
and always return anndarray
Minor changes
AutoML.subsample_if_too_large
->autosklearn.util.data.reduce_dataset_size_if_too_large
subasmple
andreduce_precision
CustomStratifiedShuffleSplit
was fixed to respect training sizeCustomStratifiedShuffleSplit
to check for unique labels and respecting training_sizeAutoml.fit():start
up untilAutoml.fit():reduce_dataset_size_if_too_large
typing
prefix, eg.typing.List -> List
to prevent clutter on long signatures and docTargetValidator
in line with it's cleaned typing and fixes.@overload
inInputValidator.transform
so mypy correctly knows when to expect one return value vs twoNotes:
The types in general have to be updated and confirmed as pointed out in issue #1264
New funcs