-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to support Woodwork 0.5.1 #2610
Conversation
b641db8
to
0d98715
Compare
Codecov Report
@@ Coverage Diff @@
## main #2610 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 297 297
Lines 27033 27071 +38
=======================================
+ Hits 26989 27027 +38
Misses 44 44
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a bit curious about the updates required for WW so took a quick look at this. Noticed a couple things that I thought I'd mention. Feel free to ignore or act upon as you see fit.
e18f41b
to
004f752
Compare
@@ -76,7 +76,10 @@ def fit(self, X, y): | |||
""" | |||
if y is None: | |||
return self | |||
y = infer_feature_types(y).to_frame() | |||
y = infer_feature_types(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the exception for the target imputer to fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but couldn't you do
y = infer_feature_types(y).to_frame()
if all(y.isnull()):
raise TypeError("Provided target full of nulls.")
just to shorten/simplify slightly?
@@ -973,7 +973,7 @@ def fit(self, X, y): | |||
return self | |||
|
|||
def predict(self, X): | |||
series = pd.Series() | |||
series = pd.Series(dtype="string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change was to accommodate the way empty series are now inferred. Woodwork complains if you don't do this.
@@ -72,15 +72,17 @@ def test_some_missing_col_names(text_df, caplog): | |||
} | |||
|
|||
|
|||
def test_lsa_empty_text_column(): | |||
X = pd.DataFrame({"col_1": []}) | |||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to expand the coverage here to cover other commonly "empty" columns. Also, the TextFeaturizer and LSA component have code that seems to expect tolerance of an empty column.
e.g.
def fit(self, X, y=None):
X = infer_feature_types(X)
self._text_columns = self._get_text_columns(X)
if len(self._text_columns) == 0:
return self
I thought this was strange that we would pass through an empty column and expect SKlearn to return the original ValueError, but also have code here that seemingly accounts for the behavior of what LSA (and TextFeaturizer) should do upon receiving an empty column. I wonder if, perhaps, the original intent was to account for two distincy cases of 1.) empty columns whose type is known as a string or natural language and 2.) an empty column whose type is unknown. Be happy to hear additional input here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think the purpose of X = infer_feature_types(X, {"col_1": "NaturalLanguage"})
was to verify that even an empty column whose type is NaturalLanguage
will be identified by self._get_text_columns(X)
and will not return self
immediately. And since transform
looks for that as well
if len(self._text_columns) == 0:
return X_ww
sklearn
would be called to transform the features and raise an error. If X = infer_feature_types(X, {"col_1": "NaturalLanguage"})
is removed, no text features are recognized and self is returned immediately.
Can't speak to the original reasoning but it looks like those are the cases being presented.
@@ -64,8 +64,8 @@ def _get_test_data_from_configuration( | |||
"[email protected]", | |||
"[email protected]", | |||
"[email protected]", | |||
"$titanic_data%&@hotmail.com", | |||
"foo*[email protected]", | |||
"[email protected]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted this issue to Woodwork to cover these email addresses which slipped through the WW EmailAddress inference. @davesque since I saw you did the Email inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten Yeah, never seen email addresses like that before :). I think it's safe to delete them from test data to accommodate the woodwork update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.
if "email" in column_names and input_type == "ww" | ||
else [] | ||
) | ||
email_featurizer = [EmailFeaturizer] if "email" in column_names else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change required for Email inference in WW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go, thanks Karsten! Agreed w/ Freddy on the double DropColumn, I think we're set on AutoML side but it's maybe worth discussing outside this context why we have AutoML do this in the first place :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I left a few questions/nits, but agreed with @freddyaboulton that we should try not to append 2 DropColumn
components to the pipeline, especially since it's likely we only set 1 of them.
@@ -76,7 +76,10 @@ def fit(self, X, y): | |||
""" | |||
if y is None: | |||
return self | |||
y = infer_feature_types(y).to_frame() | |||
y = infer_feature_types(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but couldn't you do
y = infer_feature_types(y).to_frame()
if all(y.isnull()):
raise TypeError("Provided target full of nulls.")
just to shorten/simplify slightly?
evalml/utils/woodwork_utils.py
Outdated
return ww.init_series(data, logical_type=feature_types) | ||
else: | ||
ww_data = data.copy() | ||
# Revert the inference of all nulls to the unknown type and change it back to double. | ||
all_null_cols = ww_data.columns[ww_data.isnull().all(0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I handled that specific case. Feel free to check it out. I was considering maybe adding a similar test for the target imputer with the y series being pre-inited.
15f5c1e
to
6da3e03
Compare
…ll columns as null columns are now inferred to Unknown type.
… for text featurizers to realize they have empty columns and adopt that behavior.
… to accomodate the new check for Unknown in get_pp_components. Made Email get treated properly in testing as WW should infer it properly now. Made infer_feature_types replace all pd.NA with np.nan for series as well as dataframes.
…n infer_feature_types.
…ction in infer_feature_types.
6da3e03
to
f757ba4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this @chukarsten !!
@@ -64,8 +64,8 @@ def _get_test_data_from_configuration( | |||
"[email protected]", | |||
"[email protected]", | |||
"[email protected]", | |||
"$titanic_data%&@hotmail.com", | |||
"foo*[email protected]", | |||
"[email protected]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like the new test! left one comment on a typo, but LGTM
), | ||
) | ||
def test_infer_feature_types_NA_to_nan(null_col, already_inited): | ||
"""A short test to make sure that columnds with all null values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic :)
Fixes #2543