Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Woodwork to 0.6.0 #2690

Merged
merged 45 commits into from
Aug 31, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
03e129d
release notes
ParthivNaresh Aug 24, 2021
0dac5a0
Upgrade woodwork versions
ParthivNaresh Aug 24, 2021
cdcea2c
extend length of columns to identify as categorical
ParthivNaresh Aug 25, 2021
8ef3541
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 25, 2021
ab4d80e
data checks updated
ParthivNaresh Aug 25, 2021
d2d045e
more component tests
ParthivNaresh Aug 25, 2021
7f1f92a
lgbm updates
ParthivNaresh Aug 25, 2021
eba1d4b
no message
ParthivNaresh Aug 26, 2021
9359556
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 26, 2021
22c79c1
model understanding updates
ParthivNaresh Aug 26, 2021
7db5787
imputer fixes
ParthivNaresh Aug 26, 2021
df4edc7
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 26, 2021
f3d4279
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 27, 2021
240655d
test target encoder and invalid target data checks
ParthivNaresh Aug 27, 2021
708ea7b
one hot encoder updates
ParthivNaresh Aug 27, 2021
63eb23a
more ohe
ParthivNaresh Aug 27, 2021
86ca452
segmentation fault
ParthivNaresh Aug 27, 2021
7871c74
lgbm, per column, simple imputer
ParthivNaresh Aug 27, 2021
c336f9d
imputer and partial dependence
ParthivNaresh Aug 30, 2021
1fd6231
pip install scikit-learn
ParthivNaresh Aug 30, 2021
b086a16
install woodwork
ParthivNaresh Aug 30, 2021
370f337
no message
ParthivNaresh Aug 30, 2021
64868af
test_explainers
ParthivNaresh Aug 30, 2021
6b33064
plotly update
ParthivNaresh Aug 30, 2021
d56a5ca
partial dependence
ParthivNaresh Aug 30, 2021
77374a9
lint fixes
ParthivNaresh Aug 30, 2021
644fa65
lgbm, partial dep, permutation importance
ParthivNaresh Aug 30, 2021
a3c5766
lint fixes
ParthivNaresh Aug 30, 2021
b9cdc83
delayed features
ParthivNaresh Aug 30, 2021
9294376
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 30, 2021
eb0cca3
email featurizer fix
ParthivNaresh Aug 30, 2021
829c169
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 30, 2021
3fb872e
per column imputer
ParthivNaresh Aug 30, 2021
d423bf3
Merge branch 'main' into woodwork-upgrade-0.6.0
ParthivNaresh Aug 31, 2021
8130180
change fraud100
ParthivNaresh Aug 31, 2021
3b68cab
permutation importance
ParthivNaresh Aug 31, 2021
9128d9c
model_understanding docs update
ParthivNaresh Aug 31, 2021
6fcf205
data check update
ParthivNaresh Aug 31, 2021
caefd12
update objectives
ParthivNaresh Aug 31, 2021
a562c67
test updates
ParthivNaresh Aug 31, 2021
482a0d6
more updates
ParthivNaresh Aug 31, 2021
e0ddd56
Merge branch 'main' into woodwork-upgrade-0.6.0
chukarsten Aug 31, 2021
3a353ea
featuretools upgrade
ParthivNaresh Aug 31, 2021
209a020
Merge branch 'woodwork-upgrade-0.6.0' of https://github.com/alteryx/e…
ParthivNaresh Aug 31, 2021
61c91ee
lint fix
ParthivNaresh Aug 31, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion core-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ psutil>=5.6.6
requirements-parser>=0.2.0
shap>=0.36.0
texttable>=1.6.2
woodwork==0.5.1
woodwork==0.6.0
dask>=2.12.0
featuretools>=0.26.1
nlp-primitives>=1.1.0
Expand Down
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Release Notes
* Integrated ``DefaultAlgorithm`` into ``AutoMLSearch`` :pr:`2634`
* Removed SVM "linear" and "precomputed" kernel hyperparameter options, and improved default parameters :pr:`2651`
* Updated ``ComponentGraph`` initalization to raise ``ValueError`` when user attempts to use ``.y`` for a component that does not produce a tuple output :pr:`2662`
* Updated to support Woodwork 0.6.0 :pr:`2690`
* Updated pipeline ``graph()`` to distingush X and y edges :pr:`2654`
* Added ``DropRowsTransformer`` component :pr:`2692`
* Added ``DROP_ROWS`` to ``_make_component_list_from_actions`` and clean up metadata :pr:`2694`
Expand Down
3 changes: 2 additions & 1 deletion docs/source/user_guide/model_understanding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,8 @@
"outputs": [],
"source": [
"X_fraud, y_fraud = evalml.demos.load_fraud(100, verbose=False)\n",
"X_fraud.ww.init(logical_types={\"provider\": \"Categorical\", 'region': \"Categorical\"})\n",
"X_fraud.ww.init(logical_types={\"provider\": \"Categorical\", 'region': \"Categorical\",\n",
" \"currency\": \"Categorical\", \"expiration_date\": \"Categorical\"})\n",
"\n",
"fraud_pipeline = BinaryClassificationPipeline([\"DateTime Featurization Component\",\"One Hot Encoder\", \"Random Forest Classifier\"])\n",
"fraud_pipeline.fit(X_fraud, y_fraud)\n",
Expand Down
3 changes: 2 additions & 1 deletion docs/source/user_guide/objectives.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@
"from evalml.objectives import F1\n",
"\n",
"X, y = load_fraud(n_rows=100)\n",
"X.ww.init(logical_types={\"provider\": \"Categorical\", \"region\": \"Categorical\"})\n",
"X.ww.init(logical_types={\"provider\": \"Categorical\", \"region\": \"Categorical\",\n",
" \"currency\": \"Categorical\", \"expiration_date\": \"Categorical\"})\n",
"objective = F1()\n",
"pipeline = BinaryClassificationPipeline(component_graph=['Simple Imputer', 'DateTime Featurization Component', 'One Hot Encoder', 'Random Forest Classifier'])\n",
"pipeline.fit(X, y)\n",
Expand Down
1 change: 1 addition & 0 deletions evalml/model_understanding/permutation_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,7 @@ def _shuffle_and_score_helper(
col = X_permuted.iloc[shuffling_idx, col_idx]
col.index = X_permuted.index
X_permuted.iloc[:, col_idx] = col
X_permuted.ww.init(schema=X_features.ww.schema)
Comment on lines 295 to +296
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the schema being invalidated by setting the col? Should we file a WW issue for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the dataframe outside of Woodwork, as is being done with X_permuted.iloc[:, col_idx] = col, always carries the risk of invalidating the schema.

That said, I don't think it is currently possible to do this type of assignment through WW currently with this:

X_permuted.ww.iloc[:, col_idx] = col

I'm not sure how easy it would be to implement, but we could look into it if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, based on my limited testing, as long as the new values don't change the column dtype, the schema should not be invalidated by this type of assignment outside of WW.

ParthivNaresh marked this conversation as resolved.
Show resolved Hide resolved
if is_fast:
feature_score = scorer(pipeline, X_permuted, X_features, y, objective)
else:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def _check_input_for_columns(self, X):

missing_cols = set(cols) - set(column_names)
if missing_cols:
raise ValueError("Columns of type {column_types} not found in input data.")
raise ValueError(f"Columns of type {missing_cols} not found in input data.")

@abstractmethod
def _modify_columns(self, cols, X, y=None):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,20 +72,13 @@ def transform(self, X, y=None):

es = self._make_entity_set(X_ww)
features = ft.calculate_feature_matrix(features=self._features, entityset=es)

features.set_index(X_ww.index, inplace=True)

X_ww = X_ww.ww.drop(self._columns)
features.ww.init(logical_types={col_: "categorical" for col_ in features})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're taking all features generated from DFS and Calculate Feature Matrix, and making them categorical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but currently this only applies for URL and Email features. It was the old behavior but now the problem is that ww 0.6.1 infers features created by ft as Unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after discussion with @dsherry and @freddyaboulton this is what we decided on to address the test issues for this transformer

for col in features:
X_ww.ww[col] = features[col]

all_created_columns = self._get_feature_provenance().values()
to_categorical = {
col: "Categorical"
for feature_list in all_created_columns
for col in feature_list
}
X_ww.ww.set_types(to_categorical)
return X_ww

@staticmethod
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ def test_column_transformer_transform(class_to_test, checking_functions):

if class_to_test is SelectByType:
transformer = class_to_test(column_types=["categorical", "Boolean", "Integer"])
X.ww.init(logical_types={"one": "categorical"})
else:
transformer = class_to_test(columns=list(X.columns))
assert check4(X, transformer.transform(X))
Expand Down Expand Up @@ -175,6 +176,7 @@ def test_column_transformer_fit_transform(class_to_test, checking_functions):
assert check2(X, class_to_test(columns=["one"]).fit_transform(X))

if class_to_test is SelectByType:
X.ww.init(logical_types={"one": "categorical"})
assert check3(
X,
class_to_test(
Expand Down Expand Up @@ -254,6 +256,7 @@ def test_typeortag_column_transformer_ww_logical_and_semantic_types():
"four": [4.0, 2.3, 6.5, 2.6],
}
)
X.ww.init(logical_types={"one": "categorical"})

transformer = SelectByType(column_types=[ww.logical_types.Age])
with pytest.raises(ValueError, match="not found in input data"):
Expand Down
31 changes: 20 additions & 11 deletions evalml/tests/component_tests/test_delayed_features_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ def test_delayed_feature_extractor_maxdelay3_gap1(
answer["feature"] = X.feature.astype("int64")
if not encode_y_as_str:
answer["target_delay_0"] = y_answer.astype("int64")
else:
y = y.astype("category")

assert_frame_equal(
answer, DelayedFeatureTransformer(max_delay=3, gap=1).fit_transform(X=X, y=y)
Expand Down Expand Up @@ -130,6 +132,8 @@ def test_delayed_feature_extractor_maxdelay5_gap1(
"target_delay_5": y_answer.shift(5),
}
)
if encode_y_as_str:
y = y.astype("category")
if not encode_X_as_str:
answer["feature"] = X.feature.astype("int64")
assert_frame_equal(
Expand Down Expand Up @@ -173,6 +177,8 @@ def test_delayed_feature_extractor_maxdelay3_gap7(
"target_delay_3": y_answer.shift(3),
}
)
if encode_y_as_str:
y = y.astype("category")
if not encode_X_as_str:
answer["feature"] = X.feature.astype("int64")
assert_frame_equal(
Expand All @@ -193,15 +199,9 @@ def test_delayed_feature_extractor_maxdelay3_gap7(
)


@pytest.mark.parametrize("encode_X_as_str", [True, False])
ParthivNaresh marked this conversation as resolved.
Show resolved Hide resolved
@pytest.mark.parametrize("encode_y_as_str", [True, False])
def test_delayed_feature_extractor_numpy(
encode_X_as_str, encode_y_as_str, delayed_features_data
):
def test_delayed_feature_extractor_numpy(delayed_features_data):
X, y = delayed_features_data
X, X_answer, y, y_answer = encode_X_y_as_strings(
X, y, encode_X_as_str, encode_y_as_str
)
X, X_answer, y, y_answer = encode_X_y_as_strings(X, y, False, False)
X_np = X.values
y_np = y.values
answer = pd.DataFrame(
Expand All @@ -216,8 +216,7 @@ def test_delayed_feature_extractor_numpy(
"target_delay_3": y_answer.shift(3),
}
)
if not encode_X_as_str:
answer[0] = X.feature.astype("int64")

assert_frame_equal(
answer, DelayedFeatureTransformer(max_delay=3, gap=7).fit_transform(X_np, y_np)
)
Expand Down Expand Up @@ -264,6 +263,8 @@ def test_lagged_feature_extractor_delay_features_delay_target(
"target_delay_3": y_answer.shift(3),
}
)
if encode_y_as_str:
y = y.astype("category")
if not encode_X_as_str:
all_delays["feature"] = X.feature.astype("int64")
if not delay_features:
Expand Down Expand Up @@ -307,7 +308,8 @@ def test_lagged_feature_extractor_delay_target(
"target_delay_3": y_answer.shift(3),
}
)

if encode_y_as_str:
y = y.astype("category")
transformer = DelayedFeatureTransformer(
max_delay=3, gap=1, delay_features=delay_features, delay_target=delay_target
)
Expand Down Expand Up @@ -372,6 +374,8 @@ def test_delay_feature_transformer_supports_custom_index(

X = make_data_type(data_type, X)
y = make_data_type(data_type, y)
if encode_y_as_str:
y = y.astype("category")

assert_frame_equal(
answer, DelayedFeatureTransformer(max_delay=3, gap=7).fit_transform(X, y)
Expand Down Expand Up @@ -407,6 +411,7 @@ def test_delay_feature_transformer_multiple_categorical_columns(delayed_features
"target_delay_1": y_answer.shift(1),
}
)
y = y.astype("category")
assert_frame_equal(
answer, DelayedFeatureTransformer(max_delay=1, gap=11).fit_transform(X, y)
)
Expand Down Expand Up @@ -469,9 +474,13 @@ def test_delay_feature_transformer_woodwork_custom_overrides_returned_by_compone
dft.fit(X, y)
transformed = dft.transform(X, y)
assert isinstance(transformed, pd.DataFrame)

if logical_type == Boolean:
transformed.ww.init(logical_types={"0_delay_1": "categorical"})
transformed_logical_types = {
k: type(v) for k, v in transformed.ww.logical_types.items()
}

if logical_type in [Integer, Double, Categorical]:
assert transformed_logical_types == {
0: logical_type,
Expand Down
Loading