Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change treatment of generic column type object #1415

Merged
merged 9 commits into from
Mar 15, 2022

Conversation

Louquinze
Copy link
Collaborator

This PR changes the treatment of columns with dtype object. This columns will be treated as string.

Louquinze and others added 4 commits March 2, 2022 12:23
…ng/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`.

also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction.
`auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction`

This includes adapting all *csv and *json participating in metalearning

The "real" changes are limited to
  1. truncated_svd.py
  2. feature_type_text.py
…ng/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`.

also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction.
`auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction`

This includes adapting all *csv and *json participating in metalearning

The "real" changes are limited to
  1. truncated_svd.py
  2. feature_type_text.py
The `object` type will be treated as `string` in the future.
@Louquinze Louquinze requested a review from mfeurer March 3, 2022 13:14
The `object` type will be treated as `string` in the future.
@codecov
Copy link

codecov bot commented Mar 3, 2022

Codecov Report

Merging #1415 (8d9e159) into development (457e50c) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           development    #1415      +/-   ##
===============================================
+ Coverage        84.51%   84.52%   +0.01%     
===============================================
  Files              146      146              
  Lines            11283    11285       +2     
  Branches          1929     1929              
===============================================
+ Hits              9536     9539       +3     
+ Misses            1232     1230       -2     
- Partials           515      516       +1     

Impacted file tree graph

The `object` type will be treated as `string` in the future.
mfeurer
mfeurer previously requested changes Mar 3, 2022
@@ -327,7 +325,7 @@ def get_feat_type_from_columns(
else:
raise ValueError(
"Input Column {} has unsupported dtype {}. "
"Supported column types are categorical/bool/numerical dtypes. "
"Supported column types are categorical/bool/numerical/string dtypes. " # noqa: E501
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. Please reformat the string so the lines fit within the line limit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i change it

)
X[column] = X[column].astype("string")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work for random objects? Could we have a test that the feature validator correctly handles random objects? In general, could you please extend the tests under test/test_data/test_feature_validator.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will check the behavior and then update test/test_data/test_feature_validator.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class Dummy:
    def __init__(self, x):
        self.x = x
    def __call__(self):
        print(self.x)
    def dummy_func(self):
        for i in range(100):
            print("do something 100 times")
            
dummy = Dummy(1)
dummy_2 = Dummy(2)
dummy_3 = Dummy(3)
df = pd.DataFrame({"Test object": [dummy], "Test list of objects": [[dummy_2, dummy_3]]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Test object           1 non-null      object
 1   Test list of objects  1 non-null      object
dtypes: object(2)
memory usage: 144.0+ bytes
df = df.astype({"Test object": "string", "Test list of objects": "string"})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Test object           1 non-null      string
 1   Test list of objects  1 non-null      string
dtypes: string(2)
memory usage: 144.0 bytes
print(df)
                                 Test object                               Test list of objects
0  <__main__.Dummy object at 0x7fcfbcf880a0>  [<__main__.Dummy object at 0x7fcff7cb2ac0>, <_...

The `object` type will be treated as `string` in the future.

add new test case to `test_feature_validator.py`
@Louquinze Louquinze requested a review from mfeurer March 4, 2022 10:58
The `object` type will be treated as `string` in the future.

add new test case to `test_feature_validator.py`
The `object` type will be treated as `string` in the future.

add new test case to `test_feature_validator.py`
@Louquinze Louquinze requested a review from eddiebergman March 15, 2022 13:09
Copy link
Contributor

@eddiebergman eddiebergman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, include a docstring if you like, otherwise I'll merge :)

@Louquinze
Copy link
Collaborator Author

Looks good to me, include a docstring if you like, otherwise I'll merge :)

just merge, i will add the docstring in a new PR

@Louquinze Louquinze dismissed mfeurer’s stale review March 15, 2022 14:11

Currently on vacation, so i can not merge the PR

@Louquinze Louquinze removed the request for review from mfeurer March 15, 2022 14:12
@eddiebergman eddiebergman merged commit d6b90f1 into automl:development Mar 15, 2022
eddiebergman pushed a commit that referenced this pull request Aug 18, 2022
* rename `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`.

also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction.
`auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction`

This includes adapting all *csv and *json participating in metalearning

The "real" changes are limited to
  1. truncated_svd.py
  2. feature_type_text.py

* rename `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`.

also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction.
`auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction`

This includes adapting all *csv and *json participating in metalearning

The "real" changes are limited to
  1. truncated_svd.py
  2. feature_type_text.py

* change treatment of generic column dtype `object` for pandas dataframes.
The `object` type will be treated as `string` in the future.

* change treatment of generic column dtype `object` for pandas dataframes.
The `object` type will be treated as `string` in the future.

* change treatment of generic column dtype `object` for pandas dataframes.
The `object` type will be treated as `string` in the future.

* change treatment of generic column dtype `object` for pandas dataframes.
The `object` type will be treated as `string` in the future.

add new test case to `test_feature_validator.py`

* change treatment of generic column dtype `object` for pandas dataframes.
The `object` type will be treated as `string` in the future.

add new test case to `test_feature_validator.py`

* change treatment of generic column dtype `object` for pandas dataframes.
The `object` type will be treated as `string` in the future.

add new test case to `test_feature_validator.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants