Change treatment of generic column type `object` #1415

Louquinze · 2022-03-03T13:08:21Z

This PR changes the treatment of columns with dtype object. This columns will be treated as string.

…ng/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py

The `object` type will be treated as `string` in the future.

codecov · 2022-03-03T14:47:07Z

Codecov Report

Merging #1415 (8d9e159) into development (457e50c) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           development    #1415      +/-   ##
===============================================
+ Coverage        84.51%   84.52%   +0.01%     
===============================================
  Files              146      146              
  Lines            11283    11285       +2     
  Branches          1929     1929              
===============================================
+ Hits              9536     9539       +3     
+ Misses            1232     1230       -2     
- Partials           515      516       +1

The `object` type will be treated as `string` in the future.

mfeurer · 2022-03-03T16:05:27Z

autosklearn/data/feature_validator.py

@@ -327,7 +325,7 @@ def get_feat_type_from_columns(
                else:
                    raise ValueError(
                        "Input Column {} has unsupported dtype {}. "
-                        "Supported column types are categorical/bool/numerical dtypes. "
+                        "Supported column types are categorical/bool/numerical/string dtypes. "  # noqa: E501


I don't think this is necessary. Please reformat the string so the lines fit within the line limit.

i change it

mfeurer · 2022-03-03T16:07:11Z

autosklearn/data/feature_validator.py

                    )
+                    X[column] = X[column].astype("string")


Does this work for random objects? Could we have a test that the feature validator correctly handles random objects? In general, could you please extend the tests under test/test_data/test_feature_validator.py?

i will check the behavior and then update test/test_data/test_feature_validator.py

class Dummy: def __init__(self, x): self.x = x def __call__(self): print(self.x) def dummy_func(self): for i in range(100): print("do something 100 times") dummy = Dummy(1) dummy_2 = Dummy(2) dummy_3 = Dummy(3) df = pd.DataFrame({"Test object": [dummy], "Test list of objects": [[dummy_2, dummy_3]]}) df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Test object 1 non-null object 1 Test list of objects 1 non-null object dtypes: object(2) memory usage: 144.0+ bytes df = df.astype({"Test object": "string", "Test list of objects": "string"}) df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Test object 1 non-null string 1 Test list of objects 1 non-null string dtypes: string(2) memory usage: 144.0 bytes print(df) Test object Test list of objects 0 <__main__.Dummy object at 0x7fcfbcf880a0> [<__main__.Dummy object at 0x7fcff7cb2ac0>, <_...

The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`

eddiebergman

Looks good to me, include a docstring if you like, otherwise I'll merge :)

Louquinze · 2022-03-15T14:08:55Z

Looks good to me, include a docstring if you like, otherwise I'll merge :)

just merge, i will add the docstring in a new PR

Currently on vacation, so i can not merge the PR

* rename `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py * rename `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py` * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py` * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`

Louquinze and others added 4 commits March 2, 2022 12:23

change treatment of generic column dtype object for pandas dataframes.

0560ee7

The `object` type will be treated as `string` in the future.

Merge branch 'automl:development' into development

acca511

Louquinze requested a review from mfeurer March 3, 2022 13:14

change treatment of generic column dtype object for pandas dataframes.

da91a88

The `object` type will be treated as `string` in the future.

change treatment of generic column dtype object for pandas dataframes.

9431b99

The `object` type will be treated as `string` in the future.

mfeurer previously requested changes Mar 3, 2022

View reviewed changes

change treatment of generic column dtype object for pandas dataframes.

7587a7f

The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`

Louquinze requested a review from mfeurer March 4, 2022 10:58

Louquinze added 2 commits March 4, 2022 13:07

change treatment of generic column dtype object for pandas dataframes.

b00a250

The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`

change treatment of generic column dtype object for pandas dataframes.

8d9e159

The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`

Louquinze requested a review from eddiebergman March 15, 2022 13:09

eddiebergman approved these changes Mar 15, 2022

View reviewed changes

Louquinze removed the request for review from mfeurer March 15, 2022 14:12

eddiebergman merged commit d6b90f1 into automl:development Mar 15, 2022

github-actions bot pushed a commit that referenced this pull request Mar 15, 2022

Lukas Strack: Change treatment of generic column type object (#1415)

c1510bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change treatment of generic column type `object` #1415

Change treatment of generic column type `object` #1415

Louquinze commented Mar 3, 2022

codecov bot commented Mar 3, 2022 •

edited

Loading

mfeurer Mar 3, 2022

Louquinze Mar 3, 2022

mfeurer Mar 3, 2022

Louquinze Mar 3, 2022

Louquinze Mar 4, 2022

eddiebergman left a comment

Louquinze commented Mar 15, 2022

Change treatment of generic column type object #1415

Change treatment of generic column type object #1415

Conversation

Louquinze commented Mar 3, 2022

codecov bot commented Mar 3, 2022 • edited Loading

Codecov Report

mfeurer Mar 3, 2022

Choose a reason for hiding this comment

Louquinze Mar 3, 2022

Choose a reason for hiding this comment

mfeurer Mar 3, 2022

Choose a reason for hiding this comment

Louquinze Mar 3, 2022

Choose a reason for hiding this comment

Louquinze Mar 4, 2022

Choose a reason for hiding this comment

eddiebergman left a comment

Choose a reason for hiding this comment

Louquinze commented Mar 15, 2022

Change treatment of generic column type `object` #1415

Change treatment of generic column type `object` #1415

codecov bot commented Mar 3, 2022 •

edited

Loading