-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710
Conversation
Codecov Report
@@ Coverage Diff @@
## master #18710 +/- ##
==========================================
- Coverage 91.61% 91.59% -0.03%
==========================================
Files 153 153
Lines 51363 51359 -4
==========================================
- Hits 47058 47044 -14
- Misses 4305 4315 +10
Continue to review full report at Codecov.
|
pandas/core/categorical.py
Outdated
return self | ||
# GH 18593: keep current categories if None (ordered can't be None) | ||
if dtype.categories is None: | ||
new_categories = self.categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you also set ordered=False
here (for dtype.categories
is None) and the else take the ordered from the dtype, then I believe you can remvoe 439-441 (also needt o make 450 be
dtype = CategoricalDtype(new_categories, ordered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that astype('category')
should not change the ordered attribute (so not set it always to False), so you would need to take the ordered
from self
. But then, you are not really sure if the user specified the order of the CategoricalDtype
specifically, or if the ordered=False
came from the default value.
To summarize, I think it is easier to leave it as is and treat 'category'
as a special case.
(we might want to check if we can't let the ordered keyword have a default of None to make it easier to deal with this, but that is another issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.
not sure this is True. .astype('category')
is clearly == CategoricalDtype()
which by-definition has ordered=False
. I don't know how you can have any other conclusion. Furthermore if this is NOT the case. Then we should immediately fix this. As a special case for this is monumentally confusing. The very fact that we have to have this discussion attests to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that would be changing the existing behaviour:
In [2]: pd.Categorical(['a', 'b'], ordered=True)
Out[2]:
[a, b]
Categories (2, object): [a < b]
In [3]: pd.Categorical(['a', 'b'], ordered=True).astype('category')
Out[3]:
[a, b]
Categories (2, object): [a < b]
I personally think the above is the logical behaviour, but I can also see a point in to make the above ordered=False.
Main reason for liking the above is that 'category' == CategoricalDtype()
and that CategoricalDtype
has a default of ordered=False
should be more an implementation detail to the user.
But let's maybe open a new issue to discuss that?
And keep this PR just fixing the bug without changing existing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor this to a common method that you can use in #18677 (or on that PR is ok too)
Will do that over in #18677, since it seems to be closer to being complete. Or can close the two individual PR's and create a new PR that combines both, if that would be preferable. Didn't realize the fixes would be so similar until the first PR was already in review.
pandas/core/categorical.py
Outdated
@@ -435,10 +436,24 @@ def astype(self, dtype, copy=True): | |||
.. versionadded:: 0.19.0 | |||
|
|||
""" | |||
if isinstance(dtype, compat.string_types) and dtype == 'category': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my next comment
kwargs.setdefault('categories', categories) | ||
kwargs.setdefault('ordered', ordered) | ||
return self.make_block(Categorical(self.values, **kwargs)) | ||
if is_categorical_dtype(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need all of this logic, wouldn't
values = self.values.astype(dtype, copy=copy)
return self.make_block(values, dtype=dtype)
be enough (if values is a Categorical already or dtype is a CDT, it will infer correctly, and if its not it will as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that quite works, since self.values
can be a different object depending on what self
is: if self
is already categorical, then self.values
is a Categorical
, otherwise self.values
is a numpy array.
In the numpy case, self.values.astype
raises TypeError: data type not understood
when a CDT
is passed as the dtype.
Likewise, self.make_block(Categorical(self.values, dtype=dtype))
also doesn't work by itself. In the Categorical
case, the constructor ignores the dtype
parameter when the input data is already Categorical
, so no update occurs.
Seems like the two paths are necessary? Or am I overlooking something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok seems reasonable then
give this a rebase and use |
f9a1457
to
0fb9140
Compare
Rebased and used Will write up an issue in the next day or so to discuss the behavior of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just had some small comments on the tests
expected = np.array(cat, dtype=np.float) | ||
tm.assert_numpy_array_equal(result, expected) | ||
|
||
@pytest.mark.parametrize('copy', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are not really testing the effect of the copy keyword
@@ -99,10 +99,56 @@ def test_codes_dtypes(self): | |||
result = result.remove_categories(['foo%05d' % i for i in range(300)]) | |||
assert result.codes.dtype == 'int8' | |||
|
|||
def test_astype_categorical(self): | |||
@pytest.mark.parametrize('ordered', [True, False]) | |||
@pytest.mark.parametrize('copy', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this copy needed here? it is not doing anything in all those cases (they will copy anyway, so I think just using it once differently in the test is good enough)
pandas/tests/series/test_dtypes.py
Outdated
@@ -322,6 +322,45 @@ def cmp(a, b): | |||
lambda x: x.astype('object').astype(Categorical)]: | |||
pytest.raises(TypeError, lambda: invalid(s)) | |||
|
|||
@pytest.mark.parametrize('copy', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
0fb9140
to
6702f90
Compare
Updated to remove the unnecessary |
return self.copy() | ||
return self | ||
# GH 10696/18593 | ||
dtype = self.dtype._update_dtype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to add some explicit tests for _update_dtype
at some point (separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added them in the other PR during the initial implementation actually:
pandas/pandas/tests/dtypes/test_dtypes.py
Lines 127 to 152 in 265e327
@pytest.mark.parametrize('dtype', [ | |
CategoricalDtype(list('abc'), False), | |
CategoricalDtype(list('abc'), True)]) | |
@pytest.mark.parametrize('new_dtype', [ | |
'category', | |
CategoricalDtype(None, False), | |
CategoricalDtype(None, True), | |
CategoricalDtype(list('abc'), False), | |
CategoricalDtype(list('abc'), True), | |
CategoricalDtype(list('cba'), False), | |
CategoricalDtype(list('cba'), True), | |
CategoricalDtype(list('wxyz'), False), | |
CategoricalDtype(list('wxyz'), True)]) | |
def test_update_dtype(self, dtype, new_dtype): | |
if isinstance(new_dtype, string_types) and new_dtype == 'category': | |
expected_categories = dtype.categories | |
expected_ordered = dtype.ordered | |
else: | |
expected_categories = new_dtype.categories | |
if expected_categories is None: | |
expected_categories = dtype.categories | |
expected_ordered = new_dtype.ordered | |
result = dtype._update_dtype(new_dtype) | |
tm.assert_index_equal(result.categories, expected_categories) | |
assert result.ordered is expected_ordered |
thanks @jschendel nice patches! keep em coming! |
Change in pandas-dev/pandas#18710 caused a dask failure when reading CSV files, as our `.astype` relied on the old (broken) behavior. Closes dask#2996
* COMPAT: Pandas 0.22.0 astype for categorical dtypes Change in pandas-dev/pandas#18710 caused a dask failure when reading CSV files, as our `.astype` relied on the old (broken) behavior. Closes #2996 * Fix pandas version check * Refactored * update docs * compat * Simplify * Simplify * Update changelog.rst
git diff upstream/master -u -- "*.py" | flake8 --diff
Couldn't find an issue about it, but the same problem described with
Series.astype
in the linked issues was occurring withCategorical.astype
. Put in a fix for that too with some code very similar to what was done in #18677 forCategoricalIndex.astype
. Could probably consolidate the two into a single helper function, potentially as part of #18704.