-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ComponentGraph
to handle not calling transform
during predict
, and update samplers' transform
methods s.t. fit_transform
is equivalent to fit(X, y).transform(X, y)
#2583
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2583 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 295 295
Lines 26894 26896 +2
=======================================
+ Hits 26848 26852 +4
+ Misses 46 44 -2
Continue to review full report at Codecov.
|
ComponentGraph
to handle not calling transform
during predict
, and update samplers' transform
methods s.t. fit_transform
is equivalent to fit(X, y).transform(X, y)
@@ -386,31 +381,6 @@ def _get_feature_provenance(self, input_feature_names): | |||
if len(children) | |||
} | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler logic means we no longer need this :)
@@ -926,7 +925,7 @@ def test_transformer_transform_output_type(X_y_binary): | |||
assert transform_output[0].shape == X.shape | |||
assert transform_output[1].shape[0] == X.shape[0] | |||
assert len(transform_output[1].shape) == 1 | |||
elif "sampler" in component.name: | |||
elif isinstance(component, BaseSampler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this as a way to check if we're testing for sampler component rather than using name :)
|
||
@patch("evalml.pipelines.components.estimators.LogisticRegressionClassifier.fit") | ||
@pytest.mark.parametrize("sampler", ["Undersampler", "SMOTE Oversampler"]) | ||
def test_component_graph_compute_final_component_features_with_sampler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original issue was in using a sampler while calculating permutation importance. However, the issue wasn't with permutation importance specifically but rather, the pipeline calling compute_estimator_features
which called compute_final_component_features
. Hence, testing this here :)
super().fit(X, y) | ||
self._initialize_oversampler(X, y, self.sampler) | ||
|
||
def _initialize_oversampler(self, X, y, sampler_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed _initialize_sampler
(also renamed _initalize_undersampler
) so I could consolidate the fit
method for both subclasses in the BaseSampler
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I think we should make _initialize_sampler
an abstractmethod of BaseSampler in this case then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and done 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid PR. Thank you very much for picking up after us. As always - appreciate the clean up in addition to what you set out to do. Just ultra-nitted off your hyper-nits a few times, but nothing blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Thank you for this! I appreciate the clean up in the component graph code. Left a couple of small comments but nothing blocking.
super().fit(X, y) | ||
self._initialize_oversampler(X, y, self.sampler) | ||
|
||
def _initialize_oversampler(self, X, y, sampler_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I think we should make _initialize_sampler
an abstractmethod of BaseSampler in this case then!
if isinstance(component_instance, Transformer): | ||
if fit: | ||
output = component_instance.fit_transform(input_x, input_y) | ||
output = component_instance.fit_transform(x_inputs, y_input) | ||
elif isinstance(component_instance, BaseSampler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to skip an additional transformer during fit=false
that wasn't a sampler in the future, this is the only line we would have to modify right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yupperino! 😁
X (pd.DataFrame or np.ndarray): Data of shape [n_samples, n_features] | ||
y (pd.Series, np.ndarray): True labels of length [n_samples] | ||
objectives (list): Non-empty list of objectives to score on | ||
X (pd.DataFrame or np.ndarray): Data of shape [n_samples, n_features]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin do you want to pick up #878 😂
Joking not joking though. I stopped working on it because as it stood last year, pydocstyle
would let some false negatives through. That being said, it may be able to standardize our docstrings to like "80%" of a reasonable format. Or there could be a better tool out there by now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL oo interesting, thanks for pointing me to this! Maybe I will pick this up in my spare time, since I'm usually the annoying one running around updating docstrings with periods and capitalizing random things 😂
assert len(new_X) == sum(num_samples) | ||
assert len(new_y) == sum(num_samples) | ||
value_counts = new_y.value_counts() | ||
assert len(fit_transformed_X) == sum(num_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also check fit(X, y).transform(X, y) == fit_transform(X, y) directly. My reasoning is that this test checks that the classes are balanced after calling fit and transform but that's not necessarily guarantee that fit(X, y).transform(X, y) == fit_transform(X, y) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I'm just going to remove the second set of assertions and check that fit(X, y).transform(X, y) == fit_transform(X, y); if the first set of assertions checks that fit_transform(X, y) leads to balanced classes and fit(X, y).transform(X, y) == fit_transform(X, y) then fit(X, y).transform(X, y) should also lead to balanced classes :)
BaseSampler
and subclassesA little writeup of sampler and ComponentGraph behavior here: https://alteryx.atlassian.net/wiki/spaces/PS/pages/958890019/Samplers+and+Component+Graph