Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: Behavior of .astype('category') on existing categorical data #18790

Closed
jschendel opened this issue Dec 15, 2017 · 4 comments · Fixed by #18889
Closed

DISC: Behavior of .astype('category') on existing categorical data #18790

jschendel opened this issue Dec 15, 2017 · 4 comments · Fixed by #18889
Labels
API Design Categorical Categorical Data Type
Milestone

Comments

@jschendel
Copy link
Member

Background

Follow-up from this specfic chain of comments: #18710 (comment)
And these PR's in general: #18677, #18710

Issue

For the context of this discussion, I'm only referring to data that is already categorical; I don't think there was any ambiguity with converting non-categorical to categorical. This applies using .astype('category') on Categorical, CategoricalIndex, and Series.

The crux of the issue comes down to whether .astype('category') should ever change data that is already categorical. An argument that it shouldn't is that .astype('category') doesn't explicitly specify any changes, so nothing should be changed, and it's the existing behavior.

The other argument is that .astype('category') should be equivalent to .astype(CategoricalDtype()). Note that CategoricalDtype() is the same as CategoricalDtype(categories=None, ordered=False):

In [2]: CategoricalDtype()
Out[2]: CategoricalDtype(categories=None, ordered=False)

This means that if the existing categorical data is ordered, then .astype(CategoricalDtype()) would change the categorical data from having ordered=True to ordered=False, and so .astype('category') should do the same.

I don't think there are any scenarios where the categories themselves would change; the only potential thing that could change is ordered=True to ordered=False. See below for a summary of some potential options. Feel free to modify any of the pro/cons listed below, or suggest any other potential options.

Option 1: .astype('category') does not change anything

This would not require any additional code changes, as it's the current behavior.

Pros:

  • Maintains current behavior .astype('category')
  • Less likely to cause user confusion due to unforeseen changes
    • At least in my mind, but I could be convinced otherwise
    • Forces the user to be explicit when making potentially unintended changes

Cons:

  • Inconsistent with .astype(CategoricalDtype())

Option 2: .astype('category') changes ordered=True to ordered=False

This would require some additional code changes, but is relatively minor.

Pros:

  • Makes .astype('category') consistent with .astype(CategoricalDtype())
  • A bit cleaner/more maintainable in terms of code
    • No special case checking for the string 'category'

Cons:

  • Changes current behavior of .astype('category')

Option 3: Allow ordered=None in CategoricalDtype

Basically, make CategoricalDtype() return CategoricalDtype(categories=None, ordered=None). I should preface this by saying that I have not scoped out the amount of code that would need to be changed for this, nor the potential ramifications. This may not be a good idea.

Pros:

  • Maintains current behavior .astype('category')
  • Makes .astype('category') consistent with .astype(CategoricalDtype())

Cons:

  • Changes the default behavior of CategoricalDtype
  • Could potentially involve a lot of code change and unseen ramifications
@jschendel
Copy link
Member Author

@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

I like 3. can you make a quick test to see if its feasible?

@jreback jreback added API Design Categorical Categorical Data Type labels Dec 15, 2017
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 15, 2017 via email

@jschendel
Copy link
Member Author

Option 3 doesn't look to be that bad. Should have a PR within the next day or so, depending on my free time. There are only a couple ambiguous points I've encountered:

  • Equality: How should comparisons with CDT(*, None) work?
  • Hashing: Should hashing CDT(*, None) produce a different hash?

Regarding equality, my current plan is to treat ordered=None as if it where ordered=False:

  • CDT(['a', 'b'], None) == CDT(['a', 'b'], False) --> True
  • CDT(['a', 'b'], None) == CDT(['b', 'a'], False) --> True
  • CDT(['a', 'b'], None) == CDT(['a', 'b'], True) --> False

This maintains existing comparison behavior when ordered is not specified:

  • CDT(['a', 'b'], False) == CDT(['a', 'b']) --> True
  • CDT(['a', 'b'], True) == CDT(['a', 'b']) --> False

Regarding hashing, without any code modifications CDT(*, None) will have the same hash as CDT(*, False). This seems to be consistent with how I plan to treat equality. Makes the logic implementing equality nicer too, since the case when both dtypes are unordered currently relies on hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants