Update `ComponentGraph` to handle not calling `transform` during `predict`, and update samplers' `transform` methods s.t. `fit_transform` is equivalent to `fit(X, y).transform(X, y)` #2583

angela97lin · 2021-08-03T21:15:24Z

Closes Handling of BaseSampler's Transform #2273
Removes the need to keep track of the most recent y value and consolidating inputs. We should be able to just grab the parent output now that ComponentGraphs must be fully specified.
- This is related to sampler behavior because we currently call "Undersampler.y" expecting that it'll grab the parent input. Since we're changing the undersampler to no longer return None, the component graph will take the undersampler output, which is not what we want.
General cleanup of BaseSampler and subclasses

A little writeup of sampler and ComponentGraph behavior here: https://alteryx.atlassian.net/wiki/spaces/PS/pages/958890019/Samplers+and+Component+Graph

codecov · 2021-08-03T22:18:33Z

Codecov Report

Merging #2583 (dcec55c) into main (c85233e) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2583     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        295     295             
  Lines      26894   26896      +2     
=======================================
+ Hits       26848   26852      +4     
+ Misses        46      44      -2

Impacted Files	Coverage Δ
evalml/pipelines/pipeline_base.py	`98.3% <ø> (ø)`
...valml/preprocessing/data_splitters/sampler_base.py	`100.0% <ø> (ø)`
...understanding_tests/test_permutation_importance.py	`100.0% <ø> (ø)`
evalml/pipelines/component_graph.py	`99.7% <100.0%> (+0.7%)`	⬆️
...s/components/transformers/samplers/base_sampler.py	`100.0% <100.0%> (ø)`
...s/components/transformers/samplers/oversamplers.py	`100.0% <100.0%> (ø)`
...s/components/transformers/samplers/undersampler.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_components.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_oversamplers.py	`100.0% <100.0%> (ø)`
evalml/tests/component_tests/test_undersampler.py	`100.0% <100.0%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c85233e...dcec55c. Read the comment docs.

…efactor

angela97lin · 2021-08-06T20:39:02Z

evalml/pipelines/component_graph.py

@@ -386,31 +381,6 @@ def _get_feature_provenance(self, input_feature_names):
            if len(children)
        }

-    @staticmethod


Simpler logic means we no longer need this :)

…efactor

angela97lin · 2021-08-08T03:47:08Z

evalml/tests/component_tests/test_components.py

@@ -926,7 +925,7 @@ def test_transformer_transform_output_type(X_y_binary):
                assert transform_output[0].shape == X.shape
                assert transform_output[1].shape[0] == X.shape[0]
                assert len(transform_output[1].shape) == 1
-            elif "sampler" in component.name:
+            elif isinstance(component, BaseSampler):


Using this as a way to check if we're testing for sampler component rather than using name :)

angela97lin · 2021-08-09T03:04:14Z

evalml/tests/pipeline_tests/test_component_graph.py

+
+@patch("evalml.pipelines.components.estimators.LogisticRegressionClassifier.fit")
+@pytest.mark.parametrize("sampler", ["Undersampler", "SMOTE Oversampler"])
+def test_component_graph_compute_final_component_features_with_sampler(


Original issue was in using a sampler while calculating permutation importance. However, the issue wasn't with permutation importance specifically but rather, the pipeline calling compute_estimator_features which called compute_final_component_features. Hence, testing this here :)

angela97lin · 2021-08-09T16:03:50Z

evalml/pipelines/components/transformers/samplers/base_sampler.py

-        super().fit(X, y)
-        self._initialize_oversampler(X, y, self.sampler)
-
-    def _initialize_oversampler(self, X, y, sampler_class):


Renamed _initialize_sampler (also renamed _initalize_undersampler) so I could consolidate the fit method for both subclasses in the BaseSampler class

Nice. I think we should make _initialize_sampler an abstractmethod of BaseSampler in this case then!

Agreed and done 😁

chukarsten

Solid PR. Thank you very much for picking up after us. As always - appreciate the clean up in addition to what you set out to do. Just ultra-nitted off your hyper-nits a few times, but nothing blocking.

evalml/pipelines/components/transformers/samplers/base_sampler.py

evalml/pipelines/components/transformers/samplers/undersampler.py

evalml/pipelines/pipeline_base.py

freddyaboulton

@angela97lin Thank you for this! I appreciate the clean up in the component graph code. Left a couple of small comments but nothing blocking.

freddyaboulton · 2021-08-09T18:46:52Z

evalml/pipelines/components/transformers/samplers/base_sampler.py

-        super().fit(X, y)
-        self._initialize_oversampler(X, y, self.sampler)
-
-    def _initialize_oversampler(self, X, y, sampler_class):


Nice. I think we should make _initialize_sampler an abstractmethod of BaseSampler in this case then!

freddyaboulton · 2021-08-09T18:48:43Z

evalml/pipelines/component_graph.py

            if isinstance(component_instance, Transformer):
                if fit:
-                    output = component_instance.fit_transform(input_x, input_y)
+                    output = component_instance.fit_transform(x_inputs, y_input)
+                elif isinstance(component_instance, BaseSampler):


If we wanted to skip an additional transformer during fit=false that wasn't a sampler in the future, this is the only line we would have to modify right?

Yupperino! 😁

freddyaboulton · 2021-08-09T18:53:01Z

evalml/pipelines/pipeline_base.py

-            X (pd.DataFrame or np.ndarray): Data of shape [n_samples, n_features]
-            y (pd.Series, np.ndarray): True labels of length [n_samples]
-            objectives (list): Non-empty list of objectives to score on
+            X (pd.DataFrame or np.ndarray): Data of shape [n_samples, n_features].


@angela97lin do you want to pick up #878 😂

Joking not joking though. I stopped working on it because as it stood last year, pydocstyle would let some false negatives through. That being said, it may be able to standardize our docstrings to like "80%" of a reasonable format. Or there could be a better tool out there by now!

LOL oo interesting, thanks for pointing me to this! Maybe I will pick this up in my spare time, since I'm usually the annoying one running around updating docstrings with periods and capitalizing random things 😂

freddyaboulton · 2021-08-09T18:58:14Z

evalml/tests/component_tests/test_oversamplers.py

-    assert len(new_X) == sum(num_samples)
-    assert len(new_y) == sum(num_samples)
-    value_counts = new_y.value_counts()
+    assert len(fit_transformed_X) == sum(num_samples)


I think we should also check fit(X, y).transform(X, y) == fit_transform(X, y) directly. My reasoning is that this test checks that the classes are balanced after calling fit and transform but that's not necessarily guarantee that fit(X, y).transform(X, y) == fit_transform(X, y) ?

You're right! I'm just going to remove the second set of assertions and check that fit(X, y).transform(X, y) == fit_transform(X, y); if the first set of assertions checks that fit_transform(X, y) leads to balanced classes and fit(X, y).transform(X, y) == fit_transform(X, y) then fit(X, y).transform(X, y) should also lead to balanced classes :)

angela97lin added 3 commits August 2, 2021 23:42

move logic for y to init

4ed11e3

testing

b95090d

more cleanup

7b33694

angela97lin self-assigned this Aug 3, 2021

Merge branch 'main' into 2518_refactor

e420083

angela97lin added 4 commits August 4, 2021 13:13

starting to clean up some tests, need to add test for new edge case

7bf9a52

Merge branch '2518_refactor' of github.com:alteryx/evalml into 2518_r…

b5229b1

…efactor

Merge branch 'main' into 2518_refactor

fe0783f

some cleanup of tests

107dfe1

angela97lin changed the title ~~Refactor and clean up ComponentGraph methods~~ Update ComponentGraph to handle not calling transform during predict, and update samplers' transform methods s.t. fit_transform is equivalent to fit(X, y).transform(X, y) Aug 5, 2021

angela97lin added 5 commits August 5, 2021 12:44

remove validate refactor

3b40d7c

update base sampler transform

44c33fe

move comment

424ed17

add quote to comment

88c71b0

remove test for validation cg

54ffffc

angela97lin mentioned this pull request Aug 6, 2021

Update ComponentGraph _validate_component_dict logic #2599

Merged

angela97lin added 2 commits August 5, 2021 22:45

cleaup tests

0ff9105

Merge branch 'main' into 2518_refactor

3dc137a

angela97lin commented Aug 6, 2021

View reviewed changes

angela97lin added 10 commits August 6, 2021 16:55

delete and refactor sampler

3798b52

Merge branch '2518_refactor' of github.com:alteryx/evalml into 2518_r…

9403f98

…efactor

cleanup sampler classes more

bce558f

fix smotenc not returning self

52950f9

Merge branch 'main' into 2518_refactor

eca4d78

fix more tests

a9d4bda

Merge branch '2518_refactor' of github.com:alteryx/evalml into 2518_r…

5037124

…efactor

fix undersampler

dd128e6

Merge branch 'main' into 2518_refactor

c891d91

update to remove n_jobs param from dict

0d9e6ab

angela97lin added 4 commits August 7, 2021 02:12

lint

3a6f657

capitalize None in ValueError thrown

0f8dcc8

need to clean up samplers more

84c6016

revert files for local

9eebacf

angela97lin commented Aug 8, 2021

View reviewed changes

angela97lin added 2 commits August 8, 2021 00:12

more cleanup of classes, clean up tests

dc43846

extra cleanup of tests and docstrings

9ea52b6

angela97lin commented Aug 9, 2021

View reviewed changes

angela97lin marked this pull request as ready for review August 9, 2021 15:31

angela97lin requested review from bchen1116, chukarsten, freddyaboulton, christopherbunn, eccabay, ParthivNaresh and dsherry August 9, 2021 15:31

angela97lin commented Aug 9, 2021

View reviewed changes

angela97lin mentioned this pull request Aug 9, 2021

Refactor underlying ComponentGraph implementation #2518

Closed

chukarsten approved these changes Aug 9, 2021

View reviewed changes

evalml/pipelines/components/transformers/samplers/base_sampler.py Show resolved Hide resolved

evalml/pipelines/components/transformers/samplers/undersampler.py Show resolved Hide resolved

evalml/pipelines/pipeline_base.py Outdated Show resolved Hide resolved

freddyaboulton approved these changes Aug 9, 2021

View reviewed changes

angela97lin added 5 commits August 9, 2021 15:08

Merge branch 'main' into 2518_refactor

ef2ce77

Merge branch 'main' into 2518_refactor

ecb3075

clean up sampler tests

3f7f58b

Merge branch 'main' into 2518_refactor

91b251c

retrigger

dcec55c

angela97lin merged commit 7e82a80 into main Aug 9, 2021

angela97lin deleted the 2518_refactor branch August 9, 2021 21:40

chukarsten mentioned this pull request Aug 12, 2021

Release v0.30.1 #2623

Closed

This was referenced Aug 18, 2021

Spike: have ComponentGraph skip samplers in transform/predict, and refactor sampler code #2210

Closed

Add support for creating pipelines without an estimator as the final component and add transform() to component graphs and pipelines #2625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `ComponentGraph` to handle not calling `transform` during `predict`, and update samplers' `transform` methods s.t. `fit_transform` is equivalent to `fit(X, y).transform(X, y)` #2583

Update `ComponentGraph` to handle not calling `transform` during `predict`, and update samplers' `transform` methods s.t. `fit_transform` is equivalent to `fit(X, y).transform(X, y)` #2583

angela97lin commented Aug 3, 2021 •

edited

Loading

codecov bot commented Aug 3, 2021 •

edited

Loading

angela97lin Aug 6, 2021

angela97lin Aug 8, 2021

angela97lin Aug 9, 2021

angela97lin Aug 9, 2021

freddyaboulton Aug 9, 2021

angela97lin Aug 9, 2021

chukarsten left a comment

freddyaboulton left a comment

freddyaboulton Aug 9, 2021

freddyaboulton Aug 9, 2021

angela97lin Aug 9, 2021

freddyaboulton Aug 9, 2021

angela97lin Aug 9, 2021

freddyaboulton Aug 9, 2021

angela97lin Aug 9, 2021

Update ComponentGraph to handle not calling transform during predict, and update samplers' transform methods s.t. fit_transform is equivalent to fit(X, y).transform(X, y) #2583

Update ComponentGraph to handle not calling transform during predict, and update samplers' transform methods s.t. fit_transform is equivalent to fit(X, y).transform(X, y) #2583

Conversation

angela97lin commented Aug 3, 2021 • edited Loading

codecov bot commented Aug 3, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Update `ComponentGraph` to handle not calling `transform` during `predict`, and update samplers' `transform` methods s.t. `fit_transform` is equivalent to `fit(X, y).transform(X, y)` #2583

Update `ComponentGraph` to handle not calling `transform` during `predict`, and update samplers' `transform` methods s.t. `fit_transform` is equivalent to `fit(X, y).transform(X, y)` #2583

angela97lin commented Aug 3, 2021 •

edited

Loading

codecov bot commented Aug 3, 2021 •

edited

Loading