-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to support Woodwork 0.5.1 #2610
Changes from 19 commits
7544dde
84ab5dd
9726e23
23d0579
1f2f699
371c58e
4f36996
2cb0c12
ad3270b
d60b1b1
5e0ced2
b51a46c
17010a5
aed9b6f
71d0fdc
2399fe5
f757ba4
13c589b
2543a72
dc0f82f
115fb6b
4a4af75
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -973,7 +973,7 @@ def fit(self, X, y): | |
return self | ||
|
||
def predict(self, X): | ||
series = pd.Series() | ||
series = pd.Series(dtype="string") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this change was to accommodate the way empty series are now inferred. Woodwork complains if you don't do this. |
||
series.ww.init() | ||
return series | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -64,8 +64,8 @@ def _get_test_data_from_configuration( | |
"[email protected]", | ||
"[email protected]", | ||
"[email protected]", | ||
"$titanic_data%&@hotmail.com", | ||
"foo*EMAIL@email.org", | ||
"[email protected]", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I submitted this issue to Woodwork to cover these email addresses which slipped through the WW EmailAddress inference. @davesque since I saw you did the Email inference. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @chukarsten Yeah, never seen email addresses like that before :). I think it's safe to delete them from test data to accommodate the woodwork update. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type. |
||
"fooEMAIL@email.org", | ||
"[email protected]", | ||
"[email protected]", | ||
], | ||
|
@@ -197,11 +197,7 @@ def test_make_pipeline( | |
if "text" in column_names and input_type == "ww" | ||
else [] | ||
) | ||
email_featurizer = ( | ||
[EmailFeaturizer] | ||
if "email" in column_names and input_type == "ww" | ||
else [] | ||
) | ||
email_featurizer = [EmailFeaturizer] if "email" in column_names else [] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the change required for Email inference in WW. |
||
url_featurizer = ( | ||
[URLFeaturizer] if "url" in column_names and input_type == "ww" else [] | ||
) | ||
|
@@ -213,7 +209,7 @@ def test_make_pipeline( | |
) | ||
drop_col = ( | ||
[DropColumns] | ||
if any(ltype in column_names for ltype in ["url", "email", "text"]) | ||
if any(ltype in column_names for ltype in ["url", "text"]) | ||
and input_type == "pd" | ||
else [] | ||
) | ||
|
@@ -223,8 +219,8 @@ def test_make_pipeline( | |
+ url_featurizer | ||
+ drop_null | ||
+ text_featurizer | ||
+ imputer | ||
+ drop_col | ||
+ imputer | ||
+ datetime | ||
+ delayed_features | ||
+ ohe | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,16 @@ | ||
from itertools import product | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import pytest | ||
import woodwork as ww | ||
from woodwork.logical_types import Categorical, Datetime, Ordinal | ||
from woodwork.logical_types import ( | ||
Categorical, | ||
Datetime, | ||
Double, | ||
Ordinal, | ||
Unknown, | ||
) | ||
|
||
from evalml.utils import ( | ||
_convert_numeric_dataset_pandas, | ||
|
@@ -257,3 +265,49 @@ def test_ordinal_retains_order_min(): | |
) | ||
ltypes = cat_subset.ww.logical_types | ||
assert not hasattr(ltypes["categorical"], "encoding") | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"null_col,already_inited", | ||
product( | ||
[ | ||
[None, None, None], | ||
[np.nan, np.nan, np.nan], | ||
[pd.NA, pd.NA, pd.NA], | ||
["ax23n9ck23l", "1,28&*_%*&&xejc", "xnmvz@@Dcmeods-0"], | ||
], | ||
[True, False], | ||
), | ||
) | ||
def test_infer_feature_types_NA_to_nan(null_col, already_inited): | ||
"""A short test to make sure that columnds with all null values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: columns There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. classic :) |
||
get converted from woodwork Unknown logical type with string | ||
physical type back to the original Double logical type with | ||
float physical type. Other Unknown columns should remain unchanged.""" | ||
|
||
df = pd.DataFrame( | ||
{ | ||
"unknown": null_col, | ||
}, | ||
) | ||
|
||
# Check that all null columns are inferred as Unknown type. | ||
df.ww.init() | ||
assert isinstance(df.ww.logical_types["unknown"], Unknown) | ||
if all(df["unknown"].isnull()): | ||
assert all([isinstance(x, type(pd.NA)) for x in df["unknown"]]) | ||
else: | ||
assert all([isinstance(x, str) for x in df["unknown"]]) | ||
|
||
# Use infer_feature_types() to init the WW accessor and verify that all | ||
# null columns are now Double logical types backed by np.nan and that other | ||
# columns inferred as Unknown remain untouched. | ||
del df.ww | ||
if already_inited: | ||
df.ww.init() | ||
inferred_df = infer_feature_types(df) | ||
if all(df["unknown"].isnull()): | ||
assert isinstance(inferred_df.ww.logical_types["unknown"], Double) | ||
assert all([isinstance(x, type(np.nan)) for x in inferred_df["unknown"]]) | ||
else: | ||
assert all([isinstance(x, str) for x in df["unknown"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the exception for the target imputer to fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but couldn't you do
just to shorten/simplify slightly?