Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipelineBase.compute_estimator_features fails for Undersampler #2272

Merged
merged 4 commits into from
May 17, 2021

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented May 14, 2021

Background
Our ComponentGraph (first added in Dec 2020) was not originally intended to support transformer components which can alter the supervised target data.

The undersampler was recently added as a component in the graph (#2030 ). The undersampler component is the first component we've had which alters the target data. The way the undersampler currently works is:

  • First we assume that in the ComponentGraph, each transformers' fit_transform is called only during fit, whereas transform is called during predict.
  • Define undersampler fit_transform to apply undersampling to the input data and target at the same time.
  • Define undersampler transform to be a no-op, and to pass both X and y through from input to output, because we assume this only happens at pipeline predict-time.

When we added the undersampler in #2030 and related PRs, we added test coverage for evaluating the component, and for using the component in a pipeline and calling pipeline fit/predict. That all works just great. What we didn't add coverage for is for calling the method PipelineBase.compute_estimator_features on a pipeline with the undersampler in it! This is used by permutation importance, which is how we first noticed the bug.

Symptoms
Calling PipelineBase.compute_estimator_features on a pipeline which uses the undersampler results in a stack trace. Reproducer at the bottom.

Cause
Because undersampler transform returns both X and y, our current ComponentGraph impl will save the unmodified target returned from that method under the key "Undersampler.y" in the component features dict during evaluation. This doesn't cause problems during pipeline fit and predict, but when features are computed for each transformer in _fit_transform_features_helper, that code erroneously grabs the "Undersampler.y" output and appends it to the input of the next component, which in automl and in our reproducer happens to be the estimator. This is why one of the stack traces at the bottom comes from the estimator and mentions too many features being present.

Solution
This PR has a quick fix to change BaseSampler.transform to return None instead of y. This means that when the features are computed for each transformer, the resulting component output dict will no longer contain a key "Undersampler.y".

Long-term: we need to add test coverage to ensure all pipeline methods work properly for components which modify the target like the undersampler.

@dsherry had a spike which cleans up the component graph impl and would fix this issue, #2110 . And another spike #2210 which simplifies the sampler API, although that wouldn't have fixed the issue here directly.

Repro
The following

import pandas as pd
import evalml
from sklearn.datasets import make_classification
X, y = make_classification()
X = pd.DataFrame(X)
y = pd.Series(y)
pipeline = evalml.pipelines.BinaryClassificationPipeline(component_graph=['Undersampler', 'Elastic Net Classifier'])
pipeline.fit(X, y)
pipeline.predict(X)
evalml.model_understanding.calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary')

runs fine until the call to calculate_permutation_importance, which produces this stack trace

<ipython-input-13-98cd846c0e1d> in <module>
----> 1 evalml.model_understanding.calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary')
~/development/evalml/evalml/model_understanding/graphs.py in calculate_permutation_importance(pipeline, X, y, objective, n_repeats, n_jobs, random_seed)
--> 385         perm_importance = _fast_permutation_importance(pipeline, X, y, objective, n_repeats=n_repeats, n_jobs=n_jobs,
~/development/evalml/evalml/model_understanding/graphs.py in _fast_permutation_importance(pipeline, X, y, objective, n_repeats, n_jobs, random_seed)
--> 350     baseline_score = scorer(pipeline, precomputed_features, y, objective)
~/development/evalml/evalml/model_understanding/graphs.py in scorer(pipeline, features, y, objective)
--> 342             preds = pipeline.estimator.predict_proba(features)
~/development/evalml/evalml/pipelines/components/component_base_meta.py in _check_for_fit(self, X, y)
---> 27                 return method(self, X)
~/development/evalml/evalml/pipelines/components/estimators/estimator.py in predict_proba(self, X)
---> 79             pred_proba = self._component_obj.predict_proba(X)
... (and then more frames in sklearn linear estimator)

ValueError: X has 21 features per sample; expecting 20

If you run the above but add a name to the pd Series for the target, you get a slightly different stack trace ending with

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

because the component graph somehow ends up returning all-nans for the target passed from the undersampler to the target.

@codecov
Copy link

codecov bot commented May 14, 2021

Codecov Report

Merging #2272 (88ed0c6) into main (016c831) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2272     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         280      280             
  Lines       24369    24382     +13     
=========================================
+ Hits        24347    24360     +13     
  Misses         22       22             
Impacted Files Coverage Δ
...s/components/transformers/samplers/base_sampler.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_oversamplers.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_undersampler.py 100.0% <100.0%> (ø)
...understanding_tests/test_permutation_importance.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 016c831...88ed0c6. Read the comment docs.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍 !!

is there an issue to track investigating how to fix the underlying bug?

@chukarsten
Copy link
Contributor Author

@angela97lin not yet, no. I'm still seemingly enumerating some of the issues in the hotfix. There's 6 broken tests and this only fixed one so I'll need re-review.

@dsherry dsherry changed the title 0.24.0 hotfix PipelineBase.compute_estimator_features fails for Undersampler May 14, 2021
@@ -57,7 +57,7 @@ def transform(self, X, y=None):
X = infer_feature_types(X)
if y is not None:
y = infer_feature_types(y)
return X, y
return X, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote up an explanation for why this fixes the issue in the PR description. Copying here for visibility:

Because undersampler transform returns both X and y, our current ComponentGraph impl will save the unmodified target returned from that method under the key "Undersampler.y" in the component features dict during evaluation. This doesn't cause problems during pipeline fit and predict, but when features are computed for each transformer in _fit_transform_features_helper, that code erroneously grabs the "Undersampler.y" output and appends it to the input of the next component, which in automl and in our reproducer happens to be the estimator. This is why one of the stack traces at the bottom comes from the estimator and mentions too many features being present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term, we'll need to update our component graph to be tolerant to this behavior. If a component transform happens to return a target, we should use that target. If it doesn't we should just use the last modified target. We have logic for this already in the main component graph method for fit/predict, but not yet in _fit_transform_features_helper. We can address this separately once this patch is merged and tested.

assert isinstance(transform_output[0], ww.DataTable)
assert isinstance(transform_output[1], ww.DataColumn)
elif 'sampler' in component.name:
assert isinstance(transform_output[0], ww.DataTable)
assert transform_output[1] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if 'sampler' in transformer.name:
X_t, y_t = transformer.transform(X, y)
X_t = X_t.to_dataframe()
assert y_t is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten looks great!

Before merging, please add direct coverage for calling calculate_permutation_importance on a pipeline which contains the undersampler. We can worry about adding generalized coverage later. I think a good place to do that is in evalml/tests/model_understanding_tests/test_permutation_importance.py.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @chukarsten !

@dsherry dsherry added blocker An issue blocking a release. bug Issues tracking problems with existing features. labels May 14, 2021
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chukarsten chukarsten merged commit 674c45b into main May 17, 2021
@chukarsten chukarsten mentioned this pull request May 17, 2021
@freddyaboulton freddyaboulton deleted the 0.24.0_hotfix branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker An issue blocking a release. bug Issues tracking problems with existing features.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants