Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710

Merged
merged 3 commits into from
Dec 13, 2017

Conversation

jschendel
Copy link
Member

Couldn't find an issue about it, but the same problem described with Series.astype in the linked issues was occurring with Categorical.astype. Put in a fix for that too with some code very similar to what was done in #18677 for CategoricalIndex.astype. Could probably consolidate the two into a single helper function, potentially as part of #18704.

@codecov
Copy link

codecov bot commented Dec 10, 2017

Codecov Report

Merging #18710 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18710      +/-   ##
==========================================
- Coverage   91.61%   91.59%   -0.03%     
==========================================
  Files         153      153              
  Lines       51363    51359       -4     
==========================================
- Hits        47058    47044      -14     
- Misses       4305     4315      +10
Flag Coverage Δ
#multiple 89.46% <100%> (-0.01%) ⬇️
#single 40.77% <84.61%> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.79% <100%> (+0.14%) ⬆️
pandas/core/internals.py 94.42% <100%> (-0.02%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/util/testing.py 82.32% <0%> (-0.2%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9705a48...6702f90. Read the comment docs.

return self
# GH 18593: keep current categories if None (ordered can't be None)
if dtype.categories is None:
new_categories = self.categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you also set ordered=False here (for dtype.categories is None) and the else take the ordered from the dtype, then I believe you can remvoe 439-441 (also needt o make 450 be

dtype = CategoricalDtype(new_categories, ordered)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.

To summarize, I think it is easier to leave it as is and treat 'category' as a special case.

(we might want to check if we can't let the ordered keyword have a default of None to make it easier to deal with this, but that is another issue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is pretty much the same block of code as in #18677

this is way too special casey to exist in 2 places. Can you refator this to a common method that you can use in #18677 (or on that PR is ok too).

I'd like to do that before merging either one of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.

not sure this is True. .astype('category') is clearly == CategoricalDtype() which by-definition has ordered=False. I don't know how you can have any other conclusion. Furthermore if this is NOT the case. Then we should immediately fix this. As a special case for this is monumentally confusing. The very fact that we have to have this discussion attests to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that would be changing the existing behaviour:

In [2]: pd.Categorical(['a', 'b'], ordered=True)
Out[2]: 
[a, b]
Categories (2, object): [a < b]

In [3]: pd.Categorical(['a', 'b'], ordered=True).astype('category')
Out[3]: 
[a, b]
Categories (2, object): [a < b]

I personally think the above is the logical behaviour, but I can also see a point in to make the above ordered=False.
Main reason for liking the above is that 'category' == CategoricalDtype() and that CategoricalDtype has a default of ordered=False should be more an implementation detail to the user.

But let's maybe open a new issue to discuss that?
And keep this PR just fixing the bug without changing existing behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor this to a common method that you can use in #18677 (or on that PR is ok too)

Will do that over in #18677, since it seems to be closer to being complete. Or can close the two individual PR's and create a new PR that combines both, if that would be preferable. Didn't realize the fixes would be so similar until the first PR was already in review.

@@ -435,10 +436,24 @@ def astype(self, dtype, copy=True):
.. versionadded:: 0.19.0

"""
if isinstance(dtype, compat.string_types) and dtype == 'category':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my next comment

kwargs.setdefault('categories', categories)
kwargs.setdefault('ordered', ordered)
return self.make_block(Categorical(self.values, **kwargs))
if is_categorical_dtype(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need all of this logic, wouldn't

values = self.values.astype(dtype, copy=copy)

return self.make_block(values, dtype=dtype)

be enough (if values is a Categorical already or dtype is a CDT, it will infer correctly, and if its not it will as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that quite works, since self.values can be a different object depending on what self is: if self is already categorical, then self.values is a Categorical, otherwise self.values is a numpy array.

In the numpy case, self.values.astype raises TypeError: data type not understood when a CDT is passed as the dtype.

Likewise, self.make_block(Categorical(self.values, dtype=dtype)) also doesn't work by itself. In the Categorical case, the constructor ignores the dtype parameter when the input data is already Categorical, so no update occurs.

Seems like the two paths are necessary? Or am I overlooking something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok seems reasonable then

@jreback jreback added Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Dec 10, 2017
@jreback
Copy link
Contributor

jreback commented Dec 11, 2017

give this a rebase and use _update_dtype

@jschendel
Copy link
Member Author

Rebased and used _update_dtype.

Will write up an issue in the next day or so to discuss the behavior of .astype('category') on existing categorical data, extending on the discussion that took place further up in the code review comments.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just had some small comments on the tests

expected = np.array(cat, dtype=np.float)
tm.assert_numpy_array_equal(result, expected)

@pytest.mark.parametrize('copy', [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are not really testing the effect of the copy keyword

@@ -99,10 +99,56 @@ def test_codes_dtypes(self):
result = result.remove_categories(['foo%05d' % i for i in range(300)])
assert result.codes.dtype == 'int8'

def test_astype_categorical(self):
@pytest.mark.parametrize('ordered', [True, False])
@pytest.mark.parametrize('copy', [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this copy needed here? it is not doing anything in all those cases (they will copy anyway, so I think just using it once differently in the test is good enough)

@@ -322,6 +322,45 @@ def cmp(a, b):
lambda x: x.astype('object').astype(Categorical)]:
pytest.raises(TypeError, lambda: invalid(s))

@pytest.mark.parametrize('copy', [True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@jschendel
Copy link
Member Author

Updated to remove the unnecessary copy parametrize. Removed an additional instance of this that was introduced in #18677.

return self.copy()
return self
# GH 10696/18593
dtype = self.dtype._update_dtype(dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to add some explicit tests for _update_dtype at some point (separate PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added them in the other PR during the initial implementation actually:

@pytest.mark.parametrize('dtype', [
CategoricalDtype(list('abc'), False),
CategoricalDtype(list('abc'), True)])
@pytest.mark.parametrize('new_dtype', [
'category',
CategoricalDtype(None, False),
CategoricalDtype(None, True),
CategoricalDtype(list('abc'), False),
CategoricalDtype(list('abc'), True),
CategoricalDtype(list('cba'), False),
CategoricalDtype(list('cba'), True),
CategoricalDtype(list('wxyz'), False),
CategoricalDtype(list('wxyz'), True)])
def test_update_dtype(self, dtype, new_dtype):
if isinstance(new_dtype, string_types) and new_dtype == 'category':
expected_categories = dtype.categories
expected_ordered = dtype.ordered
else:
expected_categories = new_dtype.categories
if expected_categories is None:
expected_categories = dtype.categories
expected_ordered = new_dtype.ordered
result = dtype._update_dtype(new_dtype)
tm.assert_index_equal(result.categories, expected_categories)
assert result.ordered is expected_ordered

@jreback jreback added this to the 0.22.0 milestone Dec 13, 2017
@jreback jreback merged commit 73e73db into pandas-dev:master Dec 13, 2017
@jreback
Copy link
Contributor

jreback commented Dec 13, 2017

thanks @jschendel nice patches! keep em coming!

@jschendel jschendel deleted the cat-astype-cat branch December 13, 2017 19:34
TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Dec 14, 2017
Change in pandas-dev/pandas#18710 caused a dask failure
when reading CSV files, as our `.astype` relied on the old (broken) behavior.

Closes dask#2996
TomAugspurger added a commit to dask/dask that referenced this pull request Dec 14, 2017
* COMPAT: Pandas 0.22.0 astype for categorical dtypes

Change in pandas-dev/pandas#18710 caused a dask failure
when reading CSV files, as our `.astype` relied on the old (broken) behavior.

Closes #2996

* Fix pandas version check

* Refactored

* update docs

* compat

* Simplify

* Simplify

* Update changelog.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
3 participants