Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate ordered=None for CategoricalDtype #26336

Closed
TomAugspurger opened this issue May 10, 2019 · 13 comments · Fixed by #26403
Closed

Deprecate ordered=None for CategoricalDtype #26336

TomAugspurger opened this issue May 10, 2019 · 13 comments · Fixed by #26403
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@TomAugspurger
Copy link
Contributor

xref #26327 (comment).

We currently allow CategoricalDtype.ordered to be None. It should just be a bool (default False).

In [8]: CategoricalDtype(categories=None, ordered=None).ordered
# None

categories can be None, which means "infer later on". But (IIRC) we don't do that for ordered.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions labels May 10, 2019
@jschendel
Copy link
Member

jschendel commented May 10, 2019

But (IIRC) we don't do that for ordered.

There is some reliance on this for astype, where we allow users to update categories by specifying them in a CategoricalDtype without needing to worry about specifying the same ordered (i.e. leaving it as None).

Basically:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+520.g2ef50aea04'

In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)

In [3]: s = pd.Series(list('abcdaba'), dtype=cdt1)

In [4]: cdt2 = pd.api.types.CategoricalDtype(categories=list('cedafb'))

In [5]: cdt2.ordered is None
Out[5]: True

In [6]: s = s.astype(cdt2)

In [7]: s.dtype.ordered
Out[7]: True

Having ordered default to False would change the above behavior. Not that big of a deal though, and probably better to force users to be explicit about this.

@jorisvandenbossche
Copy link
Member

Having ordered default to False would change the above behavior.

That would indeed be a breaking change, so should we maybe see if we need to deprecate it first? (not sure it is worth it)

Personally, seeing that example, I would say that's even an extra reason to change it, because I find it very confusing that doing a astype(CategoricalDtype(..)) does not follow the default but keeps the original data ordered attribute.

@TomAugspurger
Copy link
Contributor Author

Thanks @jschendel. I agree with

probably better to force users to be explicit about this.

How do we actually deprecate this behavior without breaking things? Changing the default in CategoricalDtype.__init__ to False is a breaking change...

@jorisvandenbossche
Copy link
Member

How do we actually deprecate this behavior without breaking things?

Change the default to some other sentinel that we can catch? (_default = object()) And then only raise the warning when there would actually be a change in behaviour (i.e. when he dtype is used to convert data where the original ordered would not match with the new default)
Will not be the most clean code though ...

@jschendel
Copy link
Member

Change the default to some other sentinel that we can catch? (_default = object())

Is this necessary? My understanding is that a different sentinel is used when we want to distinguish between when a user explicitly passes None versus the default None being passed. In this case I don't think it matters and we'd want to warn in either case. Or am I overlooking a different usecase?

@jorisvandenbossche
Copy link
Member

Ah, yes, you're right :) Of course, there still might be the case where the user actually passes None (CategoricalDtype(categories=[...], ordered=None), which is kind of pointless, but it currently allowed ... So the question is if we should warn for that case as well, or just hard break that in the future.

@jschendel
Copy link
Member

Seems easiest to warn in both cases, and probably more user friendly too. Opening a PR shortly.

@jschendel
Copy link
Member

The tests revealed one additional breaking case: overriding the dtype of an existing Categorical when constructing a Series:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+554.g3b24fb678'

In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)

In [3]: cdt2 = pd.api.types.CategoricalDtype(categories=list('cedafb'))

In [4]: cat = pd.Categorical(list('abcdaba'), dtype=cdt1)

In [5]: pd.Series(cat, dtype=cdt2).dtype
Out[5]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=True)

Interestingly, this doesn't appear to happen with other constructors:

In [6]: pd.Categorical(cat, dtype=cdt2).dtype
Out[6]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

In [7]: pd.CategoricalIndex(cat, dtype=cdt2).dtype
Out[7]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

In [8]: pd.Index(cat, dtype=cdt2).dtype
Out[8]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

In [9]: pd.array(cat, dtype=cdt2).dtype
Out[9]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

Will update the PR accordingly.

@jschendel
Copy link
Member

Looks like there's one additional complicating factor. When passing the string 'category' as the dtype, would we want to maintain the existing ordered, even if it's True? I think we would want to maintain the existing ordered, as we've made special cases for the string 'category' elsewhere in the codebase.

This gets a little messy in the Series constructor case, as described above, e.g.

In [1]: import pandas as pd

In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)

In [3]: cat = pd.Categorical(list('abcdaba'), dtype=cdt1)

In [4]: pd.Series(cat, dtype='category').dtype
Out[4]: CategoricalDtype(categories=['c', 'd', 'a', 'b'], ordered=True)

The current implementation in the Series constructor immediately converts 'category' to CategoricalDtype(None, None), meaning it relies on the ordered=None behavior, so some adjustments will need to be made to handle this:

if dtype is not None:
dtype = self._validate_dtype(dtype)

@jorisvandenbossche
Copy link
Member

Looks like there's one additional complicating factor. When passing the string 'category' as the dtype, would we want to maintain the existing ordered, even if it's True?

Ah, yes. I actually think this was the main reason to have this ordered=None, to be able to represent the meaning of the string 'category' dtype description ..

@jorisvandenbossche
Copy link
Member

Another option might be to actually keep this ordered=None behaviour, but still deprecate it as default? Meaning that if you do CategoricalDtype(['a', 'b', 'c']) this will actually be with ordered=False, but a 'category' string would still be converted to CategoricalDtype(ordered=None) ..
Not the nicest solution, but I don't really see how we would otherwise keep this behaviour of dtype='category' (apart from special casing that string in several places, which is also not nice)

Unless we actually want to deprecate that behaviour of dtype='category' of preserving the orderedness of the input as well, but I don't think we want that?

@jschendel
Copy link
Member

Another option might be to actually keep this ordered=None behaviour, but still deprecate it as default?

I'm in favor of this, as keeping a CategoricalDtype representation of 'category' seems like the cleaner solution. In my PR I've introduced special casing in a few places to maintain the behavior of 'category', and there's still some more special casing that needs to be added as I currently have some test failures related to this.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 7, 2019

What shall we do here? @jschendel to what state of the discussion is the PR updated? (is it ready to review?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants