Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle new oversampler nullable type incompatibility in X #4068

Merged
merged 4 commits into from
Mar 16, 2023

Conversation

tamargrey
Copy link
Contributor

closes #3974

This incompatibility only shows up when string categorical columns are present with the nullable pandas dtypes, but I chose not to add a special _handle_nullable_types implementation since it's dataset dependent vs the time series imputer that needed a different handling but that would be applied to any dataset. This is a bit of a gray area, though, so I'm open to changing it.

@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Merging #4068 (19e53a2) into main (f692e0d) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #4068     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        349     349             
  Lines      37559   37583     +24     
=======================================
+ Hits       37441   37465     +24     
  Misses       118     118             
Impacted Files Coverage Δ
...es/components/transformers/samplers/oversampler.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_oversampler.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tamargrey tamargrey force-pushed the add-oversampler-X-incompatibility branch from dccb304 to 219d78c Compare March 10, 2023 18:05
),
],
)
def test_oversampler_category_dtype_incompatibility(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the same bug as the nullable type incompatibility, and it's why we convert categorical columns to "object" in base_sampler. There is a fix in the works scikit-learn/scikit-learn#25814 (because of the wonderful @thomasjpfan who has been a key contributor to our ability to support nullable types so widely), so I wanted to add this xfailing test to remind us to also remove that logic at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay converting caegory to object was a bug fix of yours from a while back, so I just want to make sure you get a chance to take a look at this test in case I'm missing something here.

@tamargrey
Copy link
Contributor Author

Note - this doesn't need to get in for the upcoming release

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a solid approach to me. Thanks for being so thorough in the documentation of the tests and what we need to undo when it succeeds.

@tamargrey tamargrey force-pushed the add-oversampler-X-incompatibility branch 3 times, most recently from 38671f8 to 70b7935 Compare March 16, 2023 13:01
@tamargrey tamargrey force-pushed the add-oversampler-X-incompatibility branch from 70b7935 to 19e53a2 Compare March 16, 2023 15:07
@tamargrey tamargrey enabled auto-merge (squash) March 16, 2023 15:26
@tamargrey tamargrey merged commit 405655f into main Mar 16, 2023
@tamargrey tamargrey deleted the add-oversampler-X-incompatibility branch March 16, 2023 15:43
@chukarsten chukarsten mentioned this pull request Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Oversampler: Add nullable type handling for nullable y
3 participants