Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial dependence errors with column with string and NaN values #2475

Closed
jeremyliweishih opened this issue Jul 8, 2021 · 2 comments · Fixed by #2834
Closed

Partial dependence errors with column with string and NaN values #2475

jeremyliweishih opened this issue Jul 8, 2021 · 2 comments · Fixed by #2834
Assignees
Labels
bug Issues tracking problems with existing features. priority

Comments

@jeremyliweishih
Copy link
Collaborator

jeremyliweishih commented Jul 8, 2021

Partial dependence errors when a column contains both string values and NaN. There seems to be some failure in generating the grid as np.unique()` cannot run on mixed values.

import pandas as pd
from evalml import AutoMLSearch
from evalml.model_understanding import partial_dependence

df = pd.read_csv('1625078186889-mushroom_subset.csv')
y_train = df['class']
X_train = df.drop('class', axis=1)

aml = AutoMLSearch(X_train, y_train, 'binary')
aml.search()

pipeline = aml.best_pipeline

holdout = pd.read_csv('mushroom_holdout.csv')
partial_dependence(pipeline, holdout, 3)

Screen Shot 2021-07-08 at 11 24 07 AM

Notebook and datasets:
string_nan_ex.zip

@jeremyliweishih jeremyliweishih added the bug Issues tracking problems with existing features. label Jul 8, 2021
@jeremyliweishih jeremyliweishih changed the title Make sure we infer logical types properly in LG Partial dependence errors with column with string and NaN values Jul 8, 2021
@chukarsten chukarsten self-assigned this Jul 20, 2021
@freddyaboulton
Copy link
Contributor

I'm not sure if there's much we can do here until we compute the grid for partial dependence ourselves for categoricals, which is what I think #2502 is tracking or we can wait for sklearn to let us provide the grid ourselves: scikit-learn/scikit-learn#20890

@freddyaboulton
Copy link
Contributor

Chatted with @dsherry and the team at OH. We agree that in order to fix this bug we need to be able to compute our own grid for partial dependence values. That is what #2502 is tracking.

FYI @chukarsten since you werent at OH. This might affect the current sprint scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features. priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants