Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleImputer can raise TypeConversionError if mean or median strategy used with boolean data #4050

Open
tamargrey opened this issue Mar 6, 2023 · 1 comment

Comments

@tamargrey
Copy link
Contributor

tamargrey commented Mar 6, 2023

The following code will attempt to use the mean and median strategies with boolean data, which converts the values to floats and then imputes whatever the mean and median of the data is (which may very well be a floating point value that cannot then be converted back to BooleanNullable as the SimpleImputer currently attempts to do). Note, this is not reachable from AutoMLSearch currently, as the Imputer component keeps this from happening.

    import woodwork as ww
    from evalml.pipelines.components import SimpleImputer
    import pandas as pd
    
    for strategy in ["mean", "median"]:
        X_train = pd.DataFrame(
            {
                "fully_bool": pd.Series([True, False, True, True, True]  ),
                "one_nan": pd.Series([True, False, pd.NA, False, True]  ),
            },
        )
        X_train.ww.init(
            logical_types={
                "fully_bool": "Boolean",
                "one_nan": "BooleanNullable",
            },
        )

        imp = SimpleImputer(
            impute_strategy=strategy,
        )
        imp.fit(X_train)
        with pytest.raises(ww.exceptions.TypeConversionError, match="Error converting datatype for one_nan from type object to type boolean."):
            imp.transform(X_train)

We should handle this situation. We have several options for how to do this:

  • Explicitly disallow "mean" and "median" strategies for boolean values in the simple imputer - this would require adding logic that is, I assume, the reason we have a separate Imputer component in the first place
  • Implicitly disallow "mean" and "median" strategies for boolean data in the simple imputer. Note in the docstring the limitations. This might also be a good time to make it more clear that this component expects all columns to be of the same type.
  • Change those columns' types to Doubles in the new_schema prior to initializing woodwork like we do with IntegerNullable to Double. This doesn't make so much sense to me, as it implies a continuous relationship between boolean values, which doesn't make much sense to me, but if there's a use case for this that I'm missing, we can consider this.
@tamargrey
Copy link
Contributor Author

We should also think about this with the TargetImputer, which would have this same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant