Skip to content

Commit

Permalink
Updated NoVarianceDataCheck to return only Warnings (#3506)
Browse files Browse the repository at this point in the history
* updated NoVarianceDataCheck

* updated release notes

* fix tests

* Trigger Build

* edited docs
  • Loading branch information
ParthivNaresh authored May 10, 2022
1 parent d272705 commit 7ab846d
Show file tree
Hide file tree
Showing 6 changed files with 24 additions and 87 deletions.
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
* Enhancements
* Fixes
* Changes
* Changed ``NoVarianceDataCheck`` to only output warnings :pr:`3506`
* Updated ``roc_curve()`` and ``conf_matrix()`` to work with IntegerNullable and BooleanNullable types. :pr:`3465`
* Changed ``ComponentGraph._transform_features`` to raise a ``PipelineError`` instead of a ``ValueError``. This is not a breaking change because ``PipelineError`` is a subclass of ``ValueError``. :pr:`3497`
* Documentation Changes
Expand Down
52 changes: 1 addition & 51 deletions docs/source/user_guide/data_check_actions.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,6 @@
"metadata": {},
"outputs": [],
"source": [
"# let's copy the datetime at row 1 for future use\n",
"date = X_train.iloc[1]['datetime']\n",
"\n",
"# make row 1 all nan values\n",
"X_train.iloc[1] = [None] * X_train.shape[1]\n",
"\n",
Expand Down Expand Up @@ -271,15 +268,9 @@
"\n",
"# We address the errors by looking at the resulting dictionary errors listed\n",
"\n",
"# first, let's address the `TARGET_HAS_NULL` error\n",
"# let's address the `TARGET_HAS_NULL` error\n",
"y_train_no_errors.fillna(False, inplace=True)\n",
"\n",
"# here, we address the `NO_VARIANCE` error \n",
"X_train_no_errors.drop(\"no_variance\", axis=1, inplace=True)\n",
"\n",
"# lastly, we address the `DATETIME_HAS_NAN` error with the date we had saved earlier\n",
"X_train_no_errors.iloc[1, 2] = date\n",
"\n",
"# let's reinitialize the Woodwork DataTable\n",
"X_train_no_errors.ww.init()\n",
"X_train_no_errors.head()"
Expand All @@ -301,47 +292,6 @@
"results_no_errors = search_iterative(X_train_no_errors, y_train_no_errors, problem_type='binary')\n",
"results_no_errors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing removing only errors versus removing both warnings and errors\n",
"Let's see the differences in model performance when we remove only errors versus remove both warnings and errors. To do this, we compare the performance of the best pipelines on the validation data. Remember that in the search where we only address errors, we still have the `mostly_nulls` column present in the data, so we leave that column in the validation data for its respective search. We drop the other `no_variance` column from both searches.\n",
"\n",
"Additionally, we do some logical type setting since we had added additional noise to just the training data. This allows the data to be of the same types in both training and validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# drop the no_variance column\n",
"X_valid.drop(\"no_variance\", axis=1, inplace=True)\n",
"\n",
"# logical type management\n",
"X_valid.ww.init(logical_types={\"customer_present\": \"Categorical\"})\n",
"y_valid = ww.init_series(y_valid, logical_type=\"Categorical\")\n",
"\n",
"best_pipeline_no_errors = results_no_errors[0].best_pipeline\n",
"print(\"Only dropping errors:\", best_pipeline_no_errors.score(X_valid, y_valid, [\"Log Loss Binary\"]), \"\\n\")\n",
"\n",
"# drop the mostly_nulls column and reinitialize the DataTable\n",
"X_valid.drop(\"mostly_nulls\", axis=1, inplace=True)\n",
"X_valid.ww.init()\n",
"\n",
"best_pipeline_clean = results_cleaned[0].best_pipeline\n",
"print(\"Addressing all actions:\", best_pipeline_clean.score(X_valid, y_valid, [\"Log Loss Binary\"]), \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can compare the differences in model performance when we address all action items (warnings and errors) in comparison to when we only address errors. While it isn't guaranteed that addressing all actions will always have better performance, we do recommend doing so since we only raise these issues when we believe the features have problems that could negatively impact or not benefit the search."
]
}
],
"metadata": {
Expand Down
20 changes: 4 additions & 16 deletions docs/source/user_guide/data_checks.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -100,14 +100,10 @@
"no_variance_data_check = NoVarianceDataCheck()\n",
"messages = no_variance_data_check.validate(X, y)\n",
"\n",
"errors = [message for message in messages if message['level'] == 'error']\n",
"warnings = [message for message in messages if message['level'] == 'warning']\n",
"\n",
"for warning in warnings:\n",
" print(\"Warning:\", warning['message'])\n",
"\n",
"for error in errors:\n",
" print(\"Error:\", error['message'])"
" print(\"Warning:\", warning['message'])"
]
},
{
Expand All @@ -133,14 +129,10 @@
"no_variance_data_check = NoVarianceDataCheck(count_nan_as_value=True)\n",
"messages = no_variance_data_check.validate(X, y)\n",
"\n",
"errors = [message for message in messages if message['level'] == 'error']\n",
"warnings = [message for message in messages if message['level'] == 'warning']\n",
"\n",
"for warning in warnings:\n",
" print(\"Warning:\", warning['message'])\n",
"\n",
"for error in errors:\n",
" print(\"Error:\", error['message'])"
" print(\"Warning:\", warning['message'])"
]
},
{
Expand Down Expand Up @@ -643,7 +635,7 @@
"metadata": {},
"outputs": [],
"source": [
"from evalml.data_checks import NoVarianceDataCheck, DataCheckError, DataCheckWarning\n",
"from evalml.data_checks import NoVarianceDataCheck, DataCheckWarning\n",
"\n",
"X = pd.DataFrame({\"no var col\": [0, 0, 0],\n",
" \"no var col with nan\": [1, np.nan, 1],\n",
Expand All @@ -653,14 +645,10 @@
"no_variance_data_check = NoVarianceDataCheck(count_nan_as_value=True)\n",
"messages = no_variance_data_check.validate(X, y)\n",
"\n",
"errors = [message for message in messages if message['level'] == 'error']\n",
"warnings = [message for message in messages if message['level'] == 'warning']\n",
"\n",
"for warning in warnings:\n",
" print(\"Warning:\", warning['message'])\n",
"\n",
"for error in errors:\n",
" print(\"Error:\", error['message'])"
" print(\"Warning:\", warning['message'])"
]
},
{
Expand Down
19 changes: 9 additions & 10 deletions evalml/data_checks/no_variance_data_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
DataCheck,
DataCheckActionCode,
DataCheckActionOption,
DataCheckError,
DataCheckMessageCode,
DataCheckWarning,
)
Expand Down Expand Up @@ -46,7 +45,7 @@ def validate(self, X, y=None):
... {
... "message": "'First_Column' has 1 unique value.",
... "data_check_name": "NoVarianceDataCheck",
... "level": "error",
... "level": "warning",
... "details": {"columns": ["First_Column"], "rows": None},
... "code": "NO_VARIANCE",
... "action_options": [
Expand All @@ -61,7 +60,7 @@ def validate(self, X, y=None):
... {
... "message": "Y has 1 unique value.",
... "data_check_name": "NoVarianceDataCheck",
... "level": "error",
... "level": "warning",
... "details": {"columns": ["Y"], "rows": None},
... "code": "NO_VARIANCE",
... "action_options": []
Expand All @@ -81,7 +80,7 @@ def validate(self, X, y=None):
... {
... "message": "Y has 0 unique values.",
... "data_check_name": "NoVarianceDataCheck",
... "level": "error",
... "level": "warning",
... "details": {"columns": ["Y"], "rows": None},
... "code": "NO_VARIANCE_ZERO_UNIQUE",
... "action_options":[]
Expand All @@ -96,7 +95,7 @@ def validate(self, X, y=None):
... {
... "message": "'First_Column' has 1 unique value.",
... "data_check_name": "NoVarianceDataCheck",
... "level": "error",
... "level": "warning",
... "details": {"columns": ["First_Column"], "rows": None},
... "code": "NO_VARIANCE",
... "action_options": [
Expand All @@ -111,7 +110,7 @@ def validate(self, X, y=None):
... {
... "message": "Y has 1 unique value.",
... "data_check_name": "NoVarianceDataCheck",
... "level": "error",
... "level": "warning",
... "details": {"columns": ["Y"], "rows": None},
... "code": "NO_VARIANCE",
... "action_options": []
Expand Down Expand Up @@ -176,7 +175,7 @@ def validate(self, X, y=None):
two_unique_with_null_message = "{} has two unique values including nulls. Consider encoding the nulls for this column to be useful for machine learning."
if zero_unique:
messages.append(
DataCheckError(
DataCheckWarning(
message=zero_unique_message.format(
(", ").join(["'{}'".format(str(col)) for col in zero_unique]),
),
Expand All @@ -194,7 +193,7 @@ def validate(self, X, y=None):
)
if one_unique:
messages.append(
DataCheckError(
DataCheckWarning(
message=one_unique_message.format(
(", ").join(["'{}'".format(str(col)) for col in one_unique]),
),
Expand Down Expand Up @@ -244,7 +243,7 @@ def validate(self, X, y=None):

if y_unique_count == 0:
messages.append(
DataCheckError(
DataCheckWarning(
message=zero_unique_message.format(y_name),
data_check_name=self.name,
message_code=DataCheckMessageCode.NO_VARIANCE_ZERO_UNIQUE,
Expand All @@ -254,7 +253,7 @@ def validate(self, X, y=None):

elif y_unique_count == 1:
messages.append(
DataCheckError(
DataCheckWarning(
message=one_unique_message.format(y_name),
data_check_name=self.name,
message_code=DataCheckMessageCode.NO_VARIANCE,
Expand Down
8 changes: 4 additions & 4 deletions evalml/tests/data_checks_tests/test_data_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ def get_expected_messages(problem_type):
)
],
).to_dict(),
DataCheckError(
DataCheckWarning(
message="'all_null', 'also_all_null' has 0 unique values.",
data_check_name="NoVarianceDataCheck",
message_code=DataCheckMessageCode.NO_VARIANCE_ZERO_UNIQUE,
Expand All @@ -250,7 +250,7 @@ def get_expected_messages(problem_type):
)
],
).to_dict(),
DataCheckError(
DataCheckWarning(
message="'lots_of_null' has 1 unique value.",
data_check_name="NoVarianceDataCheck",
message_code=DataCheckMessageCode.NO_VARIANCE,
Expand Down Expand Up @@ -398,7 +398,7 @@ def test_default_data_checks_regression(input_type, data_checks_input_dataframe)
== expected[:3]
+ expected[4:6]
+ [
DataCheckError(
DataCheckWarning(
message="Y has 1 unique value.",
data_check_name="NoVarianceDataCheck",
message_code=DataCheckMessageCode.NO_VARIANCE,
Expand Down Expand Up @@ -490,7 +490,7 @@ def __eq__(self, series_2):
)
],
).to_dict(),
DataCheckError(
DataCheckWarning(
message="'all_null', 'also_all_null' has 0 unique values.",
data_check_name="NoVarianceDataCheck",
message_code=DataCheckMessageCode.NO_VARIANCE_ZERO_UNIQUE,
Expand Down
11 changes: 5 additions & 6 deletions evalml/tests/data_checks_tests/test_no_variance_data_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
from evalml.data_checks import (
DataCheckActionCode,
DataCheckActionOption,
DataCheckError,
DataCheckMessageCode,
DataCheckWarning,
NoVarianceDataCheck,
Expand Down Expand Up @@ -42,27 +41,27 @@
data_check_name=no_variance_data_check_name,
metadata={"columns": ["feature"]},
)
feature_0_unique = DataCheckError(
feature_0_unique = DataCheckWarning(
message="'feature' has 0 unique values.",
data_check_name=no_variance_data_check_name,
message_code=DataCheckMessageCode.NO_VARIANCE_ZERO_UNIQUE,
details={"columns": ["feature"]},
action_options=[drop_feature_action_option],
).to_dict()
feature_1_unique = DataCheckError(
feature_1_unique = DataCheckWarning(
message="'feature' has 1 unique value.",
data_check_name=no_variance_data_check_name,
message_code=DataCheckMessageCode.NO_VARIANCE,
details={"columns": ["feature"]},
action_options=[drop_feature_action_option],
).to_dict()
labels_0_unique = DataCheckError(
labels_0_unique = DataCheckWarning(
message="Y has 0 unique values.",
data_check_name=no_variance_data_check_name,
message_code=DataCheckMessageCode.NO_VARIANCE_ZERO_UNIQUE,
details={"columns": ["Y"]},
).to_dict()
labels_1_unique = DataCheckError(
labels_1_unique = DataCheckWarning(
message="Y has 1 unique value.",
data_check_name=no_variance_data_check_name,
message_code=DataCheckMessageCode.NO_VARIANCE,
Expand Down Expand Up @@ -149,7 +148,7 @@
all_null_y_with_name,
False,
[
DataCheckError(
DataCheckWarning(
message="Labels has 0 unique values.",
data_check_name=no_variance_data_check_name,
message_code=DataCheckMessageCode.NO_VARIANCE_ZERO_UNIQUE,
Expand Down

0 comments on commit 7ab846d

Please sign in to comment.