Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ComponentGraph to handle not calling transform during predict, and update samplers' transform methods s.t. fit_transform is equivalent to fit(X, y).transform(X, y) #2583

Merged
merged 37 commits into from
Aug 9, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Aug 3, 2021

  • Closes Handling of BaseSampler's Transform #2273
  • Removes the need to keep track of the most recent y value and consolidating inputs. We should be able to just grab the parent output now that ComponentGraphs must be fully specified.
    • This is related to sampler behavior because we currently call "Undersampler.y" expecting that it'll grab the parent input. Since we're changing the undersampler to no longer return None, the component graph will take the undersampler output, which is not what we want.
  • General cleanup of BaseSampler and subclasses

A little writeup of sampler and ComponentGraph behavior here: https://alteryx.atlassian.net/wiki/spaces/PS/pages/958890019/Samplers+and+Component+Graph

@angela97lin angela97lin self-assigned this Aug 3, 2021
@codecov
Copy link

codecov bot commented Aug 3, 2021

Codecov Report

Merging #2583 (dcec55c) into main (c85233e) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2583     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        295     295             
  Lines      26894   26896      +2     
=======================================
+ Hits       26848   26852      +4     
+ Misses        46      44      -2     
Impacted Files Coverage Δ
evalml/pipelines/pipeline_base.py 98.3% <ø> (ø)
...valml/preprocessing/data_splitters/sampler_base.py 100.0% <ø> (ø)
...understanding_tests/test_permutation_importance.py 100.0% <ø> (ø)
evalml/pipelines/component_graph.py 99.7% <100.0%> (+0.7%) ⬆️
...s/components/transformers/samplers/base_sampler.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/oversamplers.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/undersampler.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_oversamplers.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_undersampler.py 100.0% <100.0%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c85233e...dcec55c. Read the comment docs.

@angela97lin angela97lin changed the title Refactor and clean up ComponentGraph methods Update ComponentGraph to handle not calling transform during predict, and update samplers' transform methods s.t. fit_transform is equivalent to fit(X, y).transform(X, y) Aug 5, 2021
@@ -386,31 +381,6 @@ def _get_feature_provenance(self, input_feature_names):
if len(children)
}

@staticmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler logic means we no longer need this :)

@@ -926,7 +925,7 @@ def test_transformer_transform_output_type(X_y_binary):
assert transform_output[0].shape == X.shape
assert transform_output[1].shape[0] == X.shape[0]
assert len(transform_output[1].shape) == 1
elif "sampler" in component.name:
elif isinstance(component, BaseSampler):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this as a way to check if we're testing for sampler component rather than using name :)


@patch("evalml.pipelines.components.estimators.LogisticRegressionClassifier.fit")
@pytest.mark.parametrize("sampler", ["Undersampler", "SMOTE Oversampler"])
def test_component_graph_compute_final_component_features_with_sampler(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original issue was in using a sampler while calculating permutation importance. However, the issue wasn't with permutation importance specifically but rather, the pipeline calling compute_estimator_features which called compute_final_component_features. Hence, testing this here :)

super().fit(X, y)
self._initialize_oversampler(X, y, self.sampler)

def _initialize_oversampler(self, X, y, sampler_class):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed _initialize_sampler (also renamed _initalize_undersampler) so I could consolidate the fit method for both subclasses in the BaseSampler class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think we should make _initialize_sampler an abstractmethod of BaseSampler in this case then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and done 😁

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR. Thank you very much for picking up after us. As always - appreciate the clean up in addition to what you set out to do. Just ultra-nitted off your hyper-nits a few times, but nothing blocking.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Thank you for this! I appreciate the clean up in the component graph code. Left a couple of small comments but nothing blocking.

super().fit(X, y)
self._initialize_oversampler(X, y, self.sampler)

def _initialize_oversampler(self, X, y, sampler_class):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think we should make _initialize_sampler an abstractmethod of BaseSampler in this case then!

if isinstance(component_instance, Transformer):
if fit:
output = component_instance.fit_transform(input_x, input_y)
output = component_instance.fit_transform(x_inputs, y_input)
elif isinstance(component_instance, BaseSampler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to skip an additional transformer during fit=false that wasn't a sampler in the future, this is the only line we would have to modify right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yupperino! 😁

X (pd.DataFrame or np.ndarray): Data of shape [n_samples, n_features]
y (pd.Series, np.ndarray): True labels of length [n_samples]
objectives (list): Non-empty list of objectives to score on
X (pd.DataFrame or np.ndarray): Data of shape [n_samples, n_features].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin do you want to pick up #878 😂

Joking not joking though. I stopped working on it because as it stood last year, pydocstyle would let some false negatives through. That being said, it may be able to standardize our docstrings to like "80%" of a reasonable format. Or there could be a better tool out there by now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL oo interesting, thanks for pointing me to this! Maybe I will pick this up in my spare time, since I'm usually the annoying one running around updating docstrings with periods and capitalizing random things 😂

assert len(new_X) == sum(num_samples)
assert len(new_y) == sum(num_samples)
value_counts = new_y.value_counts()
assert len(fit_transformed_X) == sum(num_samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check fit(X, y).transform(X, y) == fit_transform(X, y) directly. My reasoning is that this test checks that the classes are balanced after calling fit and transform but that's not necessarily guarantee that fit(X, y).transform(X, y) == fit_transform(X, y) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I'm just going to remove the second set of assertions and check that fit(X, y).transform(X, y) == fit_transform(X, y); if the first set of assertions checks that fit_transform(X, y) leads to balanced classes and fit(X, y).transform(X, y) == fit_transform(X, y) then fit(X, y).transform(X, y) should also lead to balanced classes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of BaseSampler's Transform
3 participants