-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change treatment of generic column type object
#1415
Conversation
…ng/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py
…ng/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py
The `object` type will be treated as `string` in the future.
The `object` type will be treated as `string` in the future.
Codecov Report
@@ Coverage Diff @@
## development #1415 +/- ##
===============================================
+ Coverage 84.51% 84.52% +0.01%
===============================================
Files 146 146
Lines 11283 11285 +2
Branches 1929 1929
===============================================
+ Hits 9536 9539 +3
+ Misses 1232 1230 -2
- Partials 515 516 +1 |
The `object` type will be treated as `string` in the future.
@@ -327,7 +325,7 @@ def get_feat_type_from_columns( | |||
else: | |||
raise ValueError( | |||
"Input Column {} has unsupported dtype {}. " | |||
"Supported column types are categorical/bool/numerical dtypes. " | |||
"Supported column types are categorical/bool/numerical/string dtypes. " # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. Please reformat the string so the lines fit within the line limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i change it
) | ||
X[column] = X[column].astype("string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work for random objects? Could we have a test that the feature validator correctly handles random objects? In general, could you please extend the tests under test/test_data/test_feature_validator.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will check the behavior and then update test/test_data/test_feature_validator.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class Dummy:
def __init__(self, x):
self.x = x
def __call__(self):
print(self.x)
def dummy_func(self):
for i in range(100):
print("do something 100 times")
dummy = Dummy(1)
dummy_2 = Dummy(2)
dummy_3 = Dummy(3)
df = pd.DataFrame({"Test object": [dummy], "Test list of objects": [[dummy_2, dummy_3]]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Test object 1 non-null object
1 Test list of objects 1 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes
df = df.astype({"Test object": "string", "Test list of objects": "string"})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Test object 1 non-null string
1 Test list of objects 1 non-null string
dtypes: string(2)
memory usage: 144.0 bytes
print(df)
Test object Test list of objects
0 <__main__.Dummy object at 0x7fcfbcf880a0> [<__main__.Dummy object at 0x7fcff7cb2ac0>, <_...
The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`
The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`
The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, include a docstring if you like, otherwise I'll merge :)
just merge, i will add the docstring in a new PR |
Currently on vacation, so i can not merge the PR
* rename `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py * rename `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/feature_reduction` to `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction`. also rename corresponding feature reduction class FeatureReduction to TextFeatureReduction. `auto-sklearn/autosklearn/pipeline/components/data_preprocessing/text_feature_reduction/truncated_svd.py:TextFeatureReduction` This includes adapting all *csv and *json participating in metalearning The "real" changes are limited to 1. truncated_svd.py 2. feature_type_text.py * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py` * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py` * change treatment of generic column dtype `object` for pandas dataframes. The `object` type will be treated as `string` in the future. add new test case to `test_feature_validator.py`
This PR changes the treatment of columns with dtype
object
. This columns will be treated asstring
.