DISC: Behavior of .astype('category') on existing categorical data #18790

jschendel · 2017-12-15T08:13:06Z

Background

Follow-up from this specfic chain of comments: #18710 (comment)
And these PR's in general: #18677, #18710

Issue

For the context of this discussion, I'm only referring to data that is already categorical; I don't think there was any ambiguity with converting non-categorical to categorical. This applies using .astype('category') on Categorical, CategoricalIndex, and Series.

The crux of the issue comes down to whether .astype('category') should ever change data that is already categorical. An argument that it shouldn't is that .astype('category') doesn't explicitly specify any changes, so nothing should be changed, and it's the existing behavior.

The other argument is that .astype('category') should be equivalent to .astype(CategoricalDtype()). Note that CategoricalDtype() is the same as CategoricalDtype(categories=None, ordered=False):

In [2]: CategoricalDtype()
Out[2]: CategoricalDtype(categories=None, ordered=False)

This means that if the existing categorical data is ordered, then .astype(CategoricalDtype()) would change the categorical data from having ordered=True to ordered=False, and so .astype('category') should do the same.

I don't think there are any scenarios where the categories themselves would change; the only potential thing that could change is ordered=True to ordered=False. See below for a summary of some potential options. Feel free to modify any of the pro/cons listed below, or suggest any other potential options.

Option 1: `.astype('category')` does not change anything

This would not require any additional code changes, as it's the current behavior.

Pros:

Maintains current behavior .astype('category')
Less likely to cause user confusion due to unforeseen changes
- At least in my mind, but I could be convinced otherwise
- Forces the user to be explicit when making potentially unintended changes

Cons:

Inconsistent with .astype(CategoricalDtype())

Option 2: `.astype('category')` changes `ordered=True` to `ordered=False`

This would require some additional code changes, but is relatively minor.

Pros:

Makes .astype('category') consistent with .astype(CategoricalDtype())
A bit cleaner/more maintainable in terms of code
- No special case checking for the string 'category'

Cons:

Changes current behavior of .astype('category')

Option 3: Allow `ordered=None` in `CategoricalDtype`

Basically, make CategoricalDtype() return CategoricalDtype(categories=None, ordered=None). I should preface this by saying that I have not scoped out the amount of code that would need to be changed for this, nor the potential ramifications. This may not be a good idea.

Pros:

Maintains current behavior .astype('category')
Makes .astype('category') consistent with .astype(CategoricalDtype())

Cons:

Changes the default behavior of CategoricalDtype
Could potentially involve a lot of code change and unseen ramifications

The text was updated successfully, but these errors were encountered:

jschendel · 2017-12-15T08:14:20Z

cc: @jreback, @jorisvandenbossche, @TomAugspurger

jreback · 2017-12-15T11:10:37Z

I like 3. can you make a quick test to see if its feasible?

TomAugspurger · 2017-12-15T11:48:25Z

If option 3 isn't too difficult, I think it'd be best. I want `'category'` to be equivalent to `CategoricalDtype()`. Currently that's true for creating a new categorical. It'd be nice if it were true for coercing existing categoricals.

…

On Fri, Dec 15, 2017 at 5:10 AM, Jeff Reback ***@***.***> wrote: I like 3. can you make a quick test to see if its feasible? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18790 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIi5k8-puN7TCjj8NTYNNg18wijuMks5tAlO1gaJpZM4RDJRe> .

jschendel · 2017-12-20T01:22:42Z

Option 3 doesn't look to be that bad. Should have a PR within the next day or so, depending on my free time. There are only a couple ambiguous points I've encountered:

Equality: How should comparisons with CDT(*, None) work?
Hashing: Should hashing CDT(*, None) produce a different hash?

Regarding equality, my current plan is to treat ordered=None as if it where ordered=False:

CDT(['a', 'b'], None) == CDT(['a', 'b'], False) --> True
CDT(['a', 'b'], None) == CDT(['b', 'a'], False) --> True
CDT(['a', 'b'], None) == CDT(['a', 'b'], True) --> False

This maintains existing comparison behavior when ordered is not specified:

CDT(['a', 'b'], False) == CDT(['a', 'b']) --> True
CDT(['a', 'b'], True) == CDT(['a', 'b']) --> False

Regarding hashing, without any code modifications CDT(*, None) will have the same hash as CDT(*, False). This seems to be consistent with how I plan to treat equality. Makes the logic implementing equality nicer too, since the case when both dtypes are unordered currently relies on hashes.

jreback added API Design Categorical Categorical Data Type labels Dec 15, 2017

jschendel mentioned this issue Dec 21, 2017

API: Allow ordered=None in CategoricalDtype #18889

Merged

4 tasks

jreback added this to the 0.23.0 milestone Dec 21, 2017

jreback closed this as completed in #18889 Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISC: Behavior of .astype('category') on existing categorical data #18790

DISC: Behavior of .astype('category') on existing categorical data #18790

jschendel commented Dec 15, 2017

jschendel commented Dec 15, 2017

jreback commented Dec 15, 2017

TomAugspurger commented Dec 15, 2017 via email

jschendel commented Dec 20, 2017

DISC: Behavior of .astype('category') on existing categorical data #18790

DISC: Behavior of .astype('category') on existing categorical data #18790

Comments

jschendel commented Dec 15, 2017

Background

Issue

Option 1: .astype('category') does not change anything

Option 2: .astype('category') changes ordered=True to ordered=False

Option 3: Allow ordered=None in CategoricalDtype

jschendel commented Dec 15, 2017

jreback commented Dec 15, 2017

TomAugspurger commented Dec 15, 2017 via email

jschendel commented Dec 20, 2017

Option 1: `.astype('category')` does not change anything

Option 2: `.astype('category')` changes `ordered=True` to `ordered=False`

Option 3: Allow `ordered=None` in `CategoricalDtype`