PipelineBase.compute_estimator_features fails for Undersampler #2272

chukarsten · 2021-05-14T00:27:59Z

Background
Our ComponentGraph (first added in Dec 2020) was not originally intended to support transformer components which can alter the supervised target data.

The undersampler was recently added as a component in the graph (#2030 ). The undersampler component is the first component we've had which alters the target data. The way the undersampler currently works is:

First we assume that in the ComponentGraph, each transformers' fit_transform is called only during fit, whereas transform is called during predict.
Define undersampler fit_transform to apply undersampling to the input data and target at the same time.
Define undersampler transform to be a no-op, and to pass both X and y through from input to output, because we assume this only happens at pipeline predict-time.

When we added the undersampler in #2030 and related PRs, we added test coverage for evaluating the component, and for using the component in a pipeline and calling pipeline fit/predict. That all works just great. What we didn't add coverage for is for calling the method PipelineBase.compute_estimator_features on a pipeline with the undersampler in it! This is used by permutation importance, which is how we first noticed the bug.

Symptoms
Calling PipelineBase.compute_estimator_features on a pipeline which uses the undersampler results in a stack trace. Reproducer at the bottom.

Cause
Because undersampler transform returns both X and y, our current ComponentGraph impl will save the unmodified target returned from that method under the key "Undersampler.y" in the component features dict during evaluation. This doesn't cause problems during pipeline fit and predict, but when features are computed for each transformer in _fit_transform_features_helper, that code erroneously grabs the "Undersampler.y" output and appends it to the input of the next component, which in automl and in our reproducer happens to be the estimator. This is why one of the stack traces at the bottom comes from the estimator and mentions too many features being present.

Solution
This PR has a quick fix to change BaseSampler.transform to return None instead of y. This means that when the features are computed for each transformer, the resulting component output dict will no longer contain a key "Undersampler.y".

Long-term: we need to add test coverage to ensure all pipeline methods work properly for components which modify the target like the undersampler.

@dsherry had a spike which cleans up the component graph impl and would fix this issue, #2110 . And another spike #2210 which simplifies the sampler API, although that wouldn't have fixed the issue here directly.

Repro
The following

import pandas as pd
import evalml
from sklearn.datasets import make_classification
X, y = make_classification()
X = pd.DataFrame(X)
y = pd.Series(y)
pipeline = evalml.pipelines.BinaryClassificationPipeline(component_graph=['Undersampler', 'Elastic Net Classifier'])
pipeline.fit(X, y)
pipeline.predict(X)
evalml.model_understanding.calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary')

runs fine until the call to calculate_permutation_importance, which produces this stack trace

<ipython-input-13-98cd846c0e1d> in <module>
----> 1 evalml.model_understanding.calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary')
~/development/evalml/evalml/model_understanding/graphs.py in calculate_permutation_importance(pipeline, X, y, objective, n_repeats, n_jobs, random_seed)
--> 385         perm_importance = _fast_permutation_importance(pipeline, X, y, objective, n_repeats=n_repeats, n_jobs=n_jobs,
~/development/evalml/evalml/model_understanding/graphs.py in _fast_permutation_importance(pipeline, X, y, objective, n_repeats, n_jobs, random_seed)
--> 350     baseline_score = scorer(pipeline, precomputed_features, y, objective)
~/development/evalml/evalml/model_understanding/graphs.py in scorer(pipeline, features, y, objective)
--> 342             preds = pipeline.estimator.predict_proba(features)
~/development/evalml/evalml/pipelines/components/component_base_meta.py in _check_for_fit(self, X, y)
---> 27                 return method(self, X)
~/development/evalml/evalml/pipelines/components/estimators/estimator.py in predict_proba(self, X)
---> 79             pred_proba = self._component_obj.predict_proba(X)
... (and then more frames in sklearn linear estimator)

ValueError: X has 21 features per sample; expecting 20

If you run the above but add a name to the pd Series for the target, you get a slightly different stack trace ending with

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

because the component graph somehow ends up returning all-nans for the target passed from the undersampler to the target.

codecov · 2021-05-14T00:54:43Z

Codecov Report

Merging #2272 (88ed0c6) into main (016c831) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #2272     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         280      280             
  Lines       24369    24382     +13     
=========================================
+ Hits        24347    24360     +13     
  Misses         22       22

Impacted Files	Coverage Δ
...s/components/transformers/samplers/base_sampler.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_components.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_oversamplers.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_undersampler.py	`100.0% <100.0%> (ø)`
...understanding_tests/test_permutation_importance.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 016c831...88ed0c6. Read the comment docs.

angela97lin

Thanks 👍 !!

is there an issue to track investigating how to fix the underlying bug?

chukarsten · 2021-05-14T03:52:48Z

@angela97lin not yet, no. I'm still seemingly enumerating some of the issues in the hotfix. There's 6 broken tests and this only fixed one so I'll need re-review.

dsherry · 2021-05-14T13:20:58Z

evalml/pipelines/components/transformers/samplers/base_sampler.py

@@ -57,7 +57,7 @@ def transform(self, X, y=None):
        X = infer_feature_types(X)
        if y is not None:
            y = infer_feature_types(y)
-        return X, y
+        return X, None


I wrote up an explanation for why this fixes the issue in the PR description. Copying here for visibility:

Because undersampler transform returns both X and y, our current ComponentGraph impl will save the unmodified target returned from that method under the key "Undersampler.y" in the component features dict during evaluation. This doesn't cause problems during pipeline fit and predict, but when features are computed for each transformer in _fit_transform_features_helper, that code erroneously grabs the "Undersampler.y" output and appends it to the input of the next component, which in automl and in our reproducer happens to be the estimator. This is why one of the stack traces at the bottom comes from the estimator and mentions too many features being present.

Long-term, we'll need to update our component graph to be tolerant to this behavior. If a component transform happens to return a target, we should use that target. If it doesn't we should just use the last modified target. We have logic for this already in the main component graph method for fit/predict, but not yet in _fit_transform_features_helper. We can address this separately once this patch is merged and tested.

dsherry · 2021-05-14T13:22:34Z

evalml/tests/component_tests/test_components.py

                assert isinstance(transform_output[0], ww.DataTable)
                assert isinstance(transform_output[1], ww.DataColumn)
+            elif 'sampler' in component.name:
+                assert isinstance(transform_output[0], ww.DataTable)
+                assert transform_output[1] is None


dsherry · 2021-05-14T13:22:45Z

evalml/tests/component_tests/test_components.py

+    if 'sampler' in transformer.name:
+        X_t, y_t = transformer.transform(X, y)
+        X_t = X_t.to_dataframe()
+        assert y_t is None


dsherry

@chukarsten looks great!

Before merging, please add direct coverage for calling calculate_permutation_importance on a pipeline which contains the undersampler. We can worry about adding generalized coverage later. I think a good place to do that is in evalml/tests/model_understanding_tests/test_permutation_importance.py.

freddyaboulton

Looks great @chukarsten !

bchen1116

LGTM!

auto-assign bot assigned chukarsten May 14, 2021

chukarsten force-pushed the 0.24.0_hotfix branch from 25e0057 to ffbc701 Compare May 14, 2021 00:37

chukarsten requested review from bchen1116, dsherry, freddyaboulton and angela97lin May 14, 2021 01:24

angela97lin approved these changes May 14, 2021

View reviewed changes

chukarsten mentioned this pull request May 14, 2021

Handling of BaseSampler's Transform #2273

Closed

dsherry changed the title ~~0.24.0 hotfix~~ PipelineBase.compute_estimator_features fails for Undersampler May 14, 2021

dsherry reviewed May 14, 2021

View reviewed changes

dsherry approved these changes May 14, 2021

View reviewed changes

freddyaboulton approved these changes May 14, 2021

View reviewed changes

dsherry added blocker An issue blocking a release. bug Issues tracking problems with existing features. labels May 14, 2021

bchen1116 approved these changes May 14, 2021

View reviewed changes

chukarsten added 3 commits May 15, 2021 23:07

Single commit.

4ec87e5

Added temporary smoke test.

73695a5

Lint.

99c7db3

chukarsten force-pushed the 0.24.0_hotfix branch from ffbc701 to 99c7db3 Compare May 17, 2021 02:29

Merge branch 'main' into 0.24.0_hotfix

88ed0c6

chukarsten merged commit 674c45b into main May 17, 2021

chukarsten mentioned this pull request May 17, 2021

Release v0.24.1 #2279

Merged

dsherry mentioned this pull request May 18, 2021

Add Oversampling Components to AutoMLSearch #2213

Merged

freddyaboulton deleted the 0.24.0_hotfix branch May 13, 2022 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipelineBase.compute_estimator_features fails for Undersampler #2272

PipelineBase.compute_estimator_features fails for Undersampler #2272

chukarsten commented May 14, 2021 •

edited by dsherry

Loading

codecov bot commented May 14, 2021 •

edited

Loading

angela97lin left a comment

chukarsten commented May 14, 2021

dsherry May 14, 2021

dsherry May 14, 2021

dsherry May 14, 2021

dsherry May 14, 2021

dsherry left a comment

freddyaboulton left a comment

bchen1116 left a comment

PipelineBase.compute_estimator_features fails for Undersampler #2272

PipelineBase.compute_estimator_features fails for Undersampler #2272

Conversation

chukarsten commented May 14, 2021 • edited by dsherry Loading

codecov bot commented May 14, 2021 • edited Loading

Codecov Report

angela97lin left a comment

Choose a reason for hiding this comment

chukarsten commented May 14, 2021

dsherry May 14, 2021

Choose a reason for hiding this comment

dsherry May 14, 2021

Choose a reason for hiding this comment

dsherry May 14, 2021

Choose a reason for hiding this comment

dsherry May 14, 2021

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

chukarsten commented May 14, 2021 •

edited by dsherry

Loading

codecov bot commented May 14, 2021 •

edited

Loading