Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change treatment of generic column type object #1415

Merged
merged 9 commits into from
Mar 15, 2022
18 changes: 8 additions & 10 deletions autosklearn/data/feature_validator.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from typing import Dict, List, Optional, Tuple, Union, cast

import logging
import warnings

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -304,16 +305,13 @@ def get_feat_type_from_columns(
# TypeError: data type not understood in certain pandas types
elif not is_numeric_dtype(X[column]):
if X[column].dtype.name == "object":
raise ValueError(
f"Input Column {column} has invalid type object. "
"Cast it to a valid dtype before using it in Auto-Sklearn. "
"Valid types are numerical, categorical or boolean. "
"You can cast it to a valid dtype using "
"pandas.Series.astype ."
"If working with string objects, the following "
"tutorial illustrates how to work with text data: "
"https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html" # noqa: E501
warnings.warn(
f"Input Column {column} has generic type object. "
f"Autosklearn will treat this column as string. "
f"Please ensure that this setting is suitable for your task."
)
X[column] = X[column].astype("string")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work for random objects? Could we have a test that the feature validator correctly handles random objects? In general, could you please extend the tests under test/test_data/test_feature_validator.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will check the behavior and then update test/test_data/test_feature_validator.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class Dummy:
    def __init__(self, x):
        self.x = x
    def __call__(self):
        print(self.x)
    def dummy_func(self):
        for i in range(100):
            print("do something 100 times")
            
dummy = Dummy(1)
dummy_2 = Dummy(2)
dummy_3 = Dummy(3)
df = pd.DataFrame({"Test object": [dummy], "Test list of objects": [[dummy_2, dummy_3]]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Test object           1 non-null      object
 1   Test list of objects  1 non-null      object
dtypes: object(2)
memory usage: 144.0+ bytes
df = df.astype({"Test object": "string", "Test list of objects": "string"})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Test object           1 non-null      string
 1   Test list of objects  1 non-null      string
dtypes: string(2)
memory usage: 144.0 bytes
print(df)
                                 Test object                               Test list of objects
0  <__main__.Dummy object at 0x7fcfbcf880a0>  [<__main__.Dummy object at 0x7fcff7cb2ac0>, <_...

feat_type[column] = "string"
elif pd.core.dtypes.common.is_datetime_or_timedelta_dtype(
X[column].dtype
):
Expand All @@ -327,7 +325,7 @@ def get_feat_type_from_columns(
else:
raise ValueError(
"Input Column {} has unsupported dtype {}. "
"Supported column types are categorical/bool/numerical dtypes. "
"Supported column types are categorical/bool/numerical/string dtypes. " # noqa: E501
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. Please reformat the string so the lines fit within the line limit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i change it

"Make sure your data is formatted in a correct way, "
"before feeding it to Auto-Sklearn.".format(
column,
Expand Down