Updates to support Woodwork 0.5.1 #2610

chukarsten · 2021-08-09T18:33:53Z

codecov · 2021-08-09T19:06:59Z

Codecov Report

Merging #2610 (4a4af75) into main (4eee441) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2610     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        297     297             
  Lines      27033   27071     +38     
=======================================
+ Hits       26989   27027     +38     
  Misses        44      44

Impacted Files	Coverage Δ
evalml/data_checks/invalid_targets_data_check.py	`100.0% <ø> (ø)`
...ta_checks_tests/test_invalid_targets_data_check.py	`100.0% <ø> (ø)`
...components/transformers/imputers/target_imputer.py	`100.0% <100.0%> (ø)`
evalml/pipelines/utils.py	`99.2% <100.0%> (ø)`
evalml/tests/component_tests/test_components.py	`100.0% <100.0%> (ø)`
...valml/tests/component_tests/test_simple_imputer.py	`100.0% <100.0%> (ø)`
...valml/tests/component_tests/test_target_imputer.py	`100.0% <100.0%> (ø)`
evalml/tests/pipeline_tests/test_pipeline_utils.py	`100.0% <100.0%> (ø)`
evalml/tests/utils_tests/test_woodwork_utils.py	`100.0% <100.0%> (ø)`
evalml/utils/woodwork_utils.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4eee441...4a4af75. Read the comment docs.

thehomebrewnerd

Was a bit curious about the updates required for WW so took a quick look at this. Noticed a couple things that I thought I'd mention. Feel free to ignore or act upon as you see fit.

evalml/pipelines/utils.py

chukarsten · 2021-08-10T19:42:21Z

evalml/pipelines/components/transformers/imputers/target_imputer.py

@@ -76,7 +76,10 @@ def fit(self, X, y):
        """
        if y is None:
            return self
-        y = infer_feature_types(y).to_frame()
+        y = infer_feature_types(y)


Moved the exception for the target imputer to fit.

nit, but couldn't you do

y = infer_feature_types(y).to_frame() if all(y.isnull()): raise TypeError("Provided target full of nulls.")

just to shorten/simplify slightly?

chukarsten · 2021-08-10T19:45:04Z

evalml/tests/component_tests/test_components.py

@@ -973,7 +973,7 @@ def fit(self, X, y):
            return self

        def predict(self, X):
-            series = pd.Series()
+            series = pd.Series(dtype="string")


I think this change was to accommodate the way empty series are now inferred. Woodwork complains if you don't do this.

chukarsten · 2021-08-10T19:50:19Z

evalml/tests/component_tests/test_lsa.py

@@ -72,15 +72,17 @@ def test_some_missing_col_names(text_df, caplog):
    }


-def test_lsa_empty_text_column():
-    X = pd.DataFrame({"col_1": []})
+@pytest.mark.parametrize(


Tried to expand the coverage here to cover other commonly "empty" columns. Also, the TextFeaturizer and LSA component have code that seems to expect tolerance of an empty column.

e.g.

def fit(self, X, y=None): X = infer_feature_types(X) self._text_columns = self._get_text_columns(X) if len(self._text_columns) == 0: return self

I thought this was strange that we would pass through an empty column and expect SKlearn to return the original ValueError, but also have code here that seemingly accounts for the behavior of what LSA (and TextFeaturizer) should do upon receiving an empty column. I wonder if, perhaps, the original intent was to account for two distincy cases of 1.) empty columns whose type is known as a string or natural language and 2.) an empty column whose type is unknown. Be happy to hear additional input here.

Yeah I think the purpose of X = infer_feature_types(X, {"col_1": "NaturalLanguage"}) was to verify that even an empty column whose type is NaturalLanguage will be identified by self._get_text_columns(X) and will not return self immediately. And since transform looks for that as well

if len(self._text_columns) == 0: return X_ww

sklearn would be called to transform the features and raise an error. If X = infer_feature_types(X, {"col_1": "NaturalLanguage"}) is removed, no text features are recognized and self is returned immediately.

Can't speak to the original reasoning but it looks like those are the cases being presented.

chukarsten · 2021-08-10T19:52:08Z

evalml/tests/pipeline_tests/test_pipeline_utils.py

@@ -64,8 +64,8 @@ def _get_test_data_from_configuration(
                    "[email protected]",
                    "[email protected]",
                    "[email protected]",
-                    "$titanic_data%&@hotmail.com",
-                    "foo*[email protected]",
+                    "[email protected]",


I submitted this issue to Woodwork to cover these email addresses which slipped through the WW EmailAddress inference. @davesque since I saw you did the Email inference.

@chukarsten Yeah, never seen email addresses like that before :). I think it's safe to delete them from test data to accommodate the woodwork update.

Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.

chukarsten · 2021-08-10T19:55:38Z

evalml/tests/pipeline_tests/test_pipeline_utils.py

-                if "email" in column_names and input_type == "ww"
-                else []
-            )
+            email_featurizer = [EmailFeaturizer] if "email" in column_names else []


This is the change required for Email inference in WW.

evalml/pipelines/utils.py

angela97lin

I think this is good to go, thanks Karsten! Agreed w/ Freddy on the double DropColumn, I think we're set on AutoML side but it's maybe worth discussing outside this context why we have AutoML do this in the first place :P

evalml/pipelines/utils.py

bchen1116

Looking good! I left a few questions/nits, but agreed with @freddyaboulton that we should try not to append 2 DropColumn components to the pipeline, especially since it's likely we only set 1 of them.

evalml/data_checks/invalid_targets_data_check.py

bchen1116 · 2021-08-11T15:12:51Z

evalml/pipelines/components/transformers/imputers/target_imputer.py

@@ -76,7 +76,10 @@ def fit(self, X, y):
        """
        if y is None:
            return self
-        y = infer_feature_types(y).to_frame()
+        y = infer_feature_types(y)


nit, but couldn't you do

y = infer_feature_types(y).to_frame() if all(y.isnull()): raise TypeError("Provided target full of nulls.")

just to shorten/simplify slightly?

evalml/pipelines/utils.py

freddyaboulton · 2021-08-11T15:47:19Z

evalml/utils/woodwork_utils.py

        return ww.init_series(data, logical_type=feature_types)
    else:
        ww_data = data.copy()
+        # Revert the inference of all nulls to the unknown type and change it back to double.
+        all_null_cols = ww_data.columns[ww_data.isnull().all(0)]


I don't think this will work if ww is initialized before being passed into one of our components?

I think this might also cause a problem with partial dependence but I have not verified. Down to talk about it after stand-up!

I handled that specific case. Feel free to check it out. I was considering maybe adding a similar test for the target imputer with the y series being pre-inited.

…ll columns as null columns are now inferred to Unknown type.

… for text featurizers to realize they have empty columns and adopt that behavior.

… to accomodate the new check for Unknown in get_pp_components. Made Email get treated properly in testing as WW should infer it properly now. Made infer_feature_types replace all pd.NA with np.nan for series as well as dataframes.

…nd raise.

…n infer_feature_types.

…ction in infer_feature_types.

freddyaboulton

Thank you for your work on this @chukarsten !!

evalml/utils/woodwork_utils.py

evalml/tests/data_checks_tests/test_invalid_targets_data_check.py

freddyaboulton · 2021-08-12T14:58:19Z

evalml/tests/pipeline_tests/test_pipeline_utils.py

@@ -64,8 +64,8 @@ def _get_test_data_from_configuration(
                    "[email protected]",
                    "[email protected]",
                    "[email protected]",
-                    "$titanic_data%&@hotmail.com",
-                    "foo*[email protected]",
+                    "[email protected]",


Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.

bchen1116

Nice! I like the new test! left one comment on a typo, but LGTM

bchen1116 · 2021-08-12T18:41:37Z

evalml/tests/utils_tests/test_woodwork_utils.py

+    ),
+)
+def test_infer_feature_types_NA_to_nan(null_col, already_inited):
+    """A short test to make sure that columnds with all null values


typo: columns

chukarsten force-pushed the ww_051_updates branch from b641db8 to 0d98715 Compare August 9, 2021 19:02

chukarsten changed the title ~~Ww 051 updates~~ Updates to support Woodwork 0.5.1 Aug 9, 2021

thehomebrewnerd reviewed Aug 10, 2021

View reviewed changes

evalml/pipelines/utils.py Outdated Show resolved Hide resolved

evalml/pipelines/utils.py Outdated Show resolved Hide resolved

chukarsten force-pushed the ww_051_updates branch from e18f41b to 004f752 Compare August 10, 2021 16:40

chukarsten marked this pull request as ready for review August 10, 2021 18:35

auto-assign bot assigned chukarsten Aug 10, 2021

chukarsten commented Aug 10, 2021

View reviewed changes

chukarsten requested review from angela97lin, freddyaboulton, dsherry, bchen1116, christopherbunn, eccabay, jeremyliweishih and ParthivNaresh August 10, 2021 20:31

freddyaboulton reviewed Aug 10, 2021

View reviewed changes

evalml/pipelines/utils.py Outdated Show resolved Hide resolved

angela97lin approved these changes Aug 11, 2021

View reviewed changes

evalml/pipelines/utils.py Outdated Show resolved Hide resolved

bchen1116 requested changes Aug 11, 2021

View reviewed changes

freddyaboulton suggested changes Aug 11, 2021

View reviewed changes

chukarsten force-pushed the ww_051_updates branch from 15f5c1e to 6da3e03 Compare August 12, 2021 04:31

chukarsten added 9 commits August 12, 2021 00:56

Fixed target imputer.

7544dde

Fixed test_components.py

84ab5dd

Fixed the invalidtarget datacheck to allow for the new Unknown type.

9726e23

Updated the preprocessing components to not attempt to double drop Nu…

23d0579

…ll columns as null columns are now inferred to Unknown type.

Fill pd.NA with np.nan in reversion to numerical nans. Modified tests…

1f2f699

… for text featurizers to realize they have empty columns and adopt that behavior.

Lint.

371c58e

Release.

4f36996

Changed the target imputer to just look for all nulls in the target a…

ad3270b

…nd raise.

chukarsten added 8 commits August 12, 2021 00:56

Set lower limit on WW to 0.5.1.

d60b1b1

Pinned WW to 0.5.1

5e0ced2

Bumped the min core reqs to ww 0.5.1.

b51a46c

Addressed Nate's comments.

17010a5

Added a test to address Freddy's concern and modified the reversion i…

aed9b6f

…n infer_feature_types.

Refactored the reversion of all-null Unknown columns into its own fun…

71d0fdc

…ction in infer_feature_types.

Reverted the text featurizer and LSA tests.

2399fe5

Reverted the additional DropColumn transformer change.

f757ba4

chukarsten force-pushed the ww_051_updates branch from 6da3e03 to f757ba4 Compare August 12, 2021 04:56

Fixed invalid target datacheck.

13c589b

freddyaboulton approved these changes Aug 12, 2021

View reviewed changes

Update for explicit testing of infer_feature_types.

2543a72

bchen1116 approved these changes Aug 12, 2021

View reviewed changes

chukarsten added 3 commits August 12, 2021 14:55

Addressed comments.

dc0f82f

Trigger build.

115fb6b

Updated latest_dep_versions.txt

4a4af75

chukarsten merged commit 78833f0 into main Aug 12, 2021

chukarsten mentioned this pull request Aug 12, 2021

Release v0.30.1 #2623

Closed

freddyaboulton mentioned this pull request Aug 16, 2021

Spike: Investigate increase in fit time for KDDCup dataset #2642

Closed

freddyaboulton deleted the ww_051_updates branch May 13, 2022 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to support Woodwork 0.5.1 #2610

Updates to support Woodwork 0.5.1 #2610

chukarsten commented Aug 9, 2021 •

edited

Loading

codecov bot commented Aug 9, 2021 •

edited

Loading

thehomebrewnerd left a comment

chukarsten Aug 10, 2021

bchen1116 Aug 11, 2021

chukarsten Aug 10, 2021

chukarsten Aug 10, 2021

ParthivNaresh Aug 11, 2021 •

edited

Loading

chukarsten Aug 10, 2021

davesque Aug 10, 2021

freddyaboulton Aug 12, 2021

chukarsten Aug 10, 2021

angela97lin left a comment

bchen1116 left a comment

bchen1116 Aug 11, 2021

freddyaboulton Aug 11, 2021

chukarsten Aug 12, 2021

freddyaboulton left a comment

freddyaboulton Aug 12, 2021

bchen1116 left a comment

bchen1116 Aug 12, 2021

chukarsten Aug 12, 2021

Updates to support Woodwork 0.5.1 #2610

Updates to support Woodwork 0.5.1 #2610

Conversation

chukarsten commented Aug 9, 2021 • edited Loading

codecov bot commented Aug 9, 2021 • edited Loading

Codecov Report

thehomebrewnerd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParthivNaresh Aug 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten commented Aug 9, 2021 •

edited

Loading

codecov bot commented Aug 9, 2021 •

edited

Loading

ParthivNaresh Aug 11, 2021 •

edited

Loading