Deprecate ordered=None for CategoricalDtype #26336

TomAugspurger · 2019-05-10T15:59:56Z

We currently allow CategoricalDtype.ordered to be None. It should just be a bool (default False).

In [8]: CategoricalDtype(categories=None, ordered=None).ordered
# None

categories can be None, which means "infer later on". But (IIRC) we don't do that for ordered.

The text was updated successfully, but these errors were encountered:

jschendel · 2019-05-10T22:24:40Z

But (IIRC) we don't do that for ordered.

There is some reliance on this for astype, where we allow users to update categories by specifying them in a CategoricalDtype without needing to worry about specifying the same ordered (i.e. leaving it as None).

Basically:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+520.g2ef50aea04'

In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)

In [3]: s = pd.Series(list('abcdaba'), dtype=cdt1)

In [4]: cdt2 = pd.api.types.CategoricalDtype(categories=list('cedafb'))

In [5]: cdt2.ordered is None
Out[5]: True

In [6]: s = s.astype(cdt2)

In [7]: s.dtype.ordered
Out[7]: True

Having ordered default to False would change the above behavior. Not that big of a deal though, and probably better to force users to be explicit about this.

jorisvandenbossche · 2019-05-11T12:51:48Z

Having ordered default to False would change the above behavior.

That would indeed be a breaking change, so should we maybe see if we need to deprecate it first? (not sure it is worth it)

Personally, seeing that example, I would say that's even an extra reason to change it, because I find it very confusing that doing a astype(CategoricalDtype(..)) does not follow the default but keeps the original data ordered attribute.

TomAugspurger · 2019-05-13T15:15:38Z

Thanks @jschendel. I agree with

probably better to force users to be explicit about this.

How do we actually deprecate this behavior without breaking things? Changing the default in CategoricalDtype.__init__ to False is a breaking change...

jorisvandenbossche · 2019-05-13T15:36:42Z

How do we actually deprecate this behavior without breaking things?

Change the default to some other sentinel that we can catch? (_default = object()) And then only raise the warning when there would actually be a change in behaviour (i.e. when he dtype is used to convert data where the original ordered would not match with the new default)
Will not be the most clean code though ...

jschendel · 2019-05-15T06:04:56Z

Change the default to some other sentinel that we can catch? (_default = object())

Is this necessary? My understanding is that a different sentinel is used when we want to distinguish between when a user explicitly passes None versus the default None being passed. In this case I don't think it matters and we'd want to warn in either case. Or am I overlooking a different usecase?

jorisvandenbossche · 2019-05-15T06:08:10Z

Ah, yes, you're right :) Of course, there still might be the case where the user actually passes None (CategoricalDtype(categories=[...], ordered=None), which is kind of pointless, but it currently allowed ... So the question is if we should warn for that case as well, or just hard break that in the future.

jschendel · 2019-05-15T06:14:29Z

Seems easiest to warn in both cases, and probably more user friendly too. Opening a PR shortly.

jschendel · 2019-05-21T03:38:00Z

The tests revealed one additional breaking case: overriding the dtype of an existing Categorical when constructing a Series:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+554.g3b24fb678'

In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)

In [3]: cdt2 = pd.api.types.CategoricalDtype(categories=list('cedafb'))

In [4]: cat = pd.Categorical(list('abcdaba'), dtype=cdt1)

In [5]: pd.Series(cat, dtype=cdt2).dtype
Out[5]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=True)

Interestingly, this doesn't appear to happen with other constructors:

In [6]: pd.Categorical(cat, dtype=cdt2).dtype
Out[6]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

In [7]: pd.CategoricalIndex(cat, dtype=cdt2).dtype
Out[7]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

In [8]: pd.Index(cat, dtype=cdt2).dtype
Out[8]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

In [9]: pd.array(cat, dtype=cdt2).dtype
Out[9]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)

Will update the PR accordingly.

jschendel · 2019-05-21T03:59:18Z

Looks like there's one additional complicating factor. When passing the string 'category' as the dtype, would we want to maintain the existing ordered, even if it's True? I think we would want to maintain the existing ordered, as we've made special cases for the string 'category' elsewhere in the codebase.

This gets a little messy in the Series constructor case, as described above, e.g.

In [1]: import pandas as pd

In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)

In [3]: cat = pd.Categorical(list('abcdaba'), dtype=cdt1)

In [4]: pd.Series(cat, dtype='category').dtype
Out[4]: CategoricalDtype(categories=['c', 'd', 'a', 'b'], ordered=True)

The current implementation in the Series constructor immediately converts 'category' to CategoricalDtype(None, None), meaning it relies on the ordered=None behavior, so some adjustments will need to be made to handle this:

pandas/pandas/core/series.py

Lines 170 to 171 in f5cc078

    
           if dtype is not None: 
        
               dtype = self._validate_dtype(dtype)

jorisvandenbossche · 2019-05-21T06:11:30Z

Looks like there's one additional complicating factor. When passing the string 'category' as the dtype, would we want to maintain the existing ordered, even if it's True?

Ah, yes. I actually think this was the main reason to have this ordered=None, to be able to represent the meaning of the string 'category' dtype description ..

jorisvandenbossche · 2019-05-21T06:17:42Z

Another option might be to actually keep this ordered=None behaviour, but still deprecate it as default? Meaning that if you do CategoricalDtype(['a', 'b', 'c']) this will actually be with ordered=False, but a 'category' string would still be converted to CategoricalDtype(ordered=None) ..
Not the nicest solution, but I don't really see how we would otherwise keep this behaviour of dtype='category' (apart from special casing that string in several places, which is also not nice)

Unless we actually want to deprecate that behaviour of dtype='category' of preserving the orderedness of the input as well, but I don't think we want that?

jschendel · 2019-05-21T16:46:30Z

Another option might be to actually keep this ordered=None behaviour, but still deprecate it as default?

I'm in favor of this, as keeping a CategoricalDtype representation of 'category' seems like the cleaner solution. In my PR I've introduced special casing in a few places to maintain the behavior of 'category', and there's still some more special casing that needs to be added as I currently have some test failures related to this.

jorisvandenbossche · 2019-06-07T07:09:38Z

What shall we do here? @jschendel to what state of the discussion is the PR updated? (is it ready to review?)

TomAugspurger mentioned this issue May 10, 2019

Add type hints to dtypes/dtypes.py (CategoricalDtype) #26327

Merged

TomAugspurger added Categorical Categorical Data Type Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions labels May 10, 2019

jorisvandenbossche added this to the 0.25.0 milestone May 11, 2019

TomAugspurger added good first issue and removed good first issue labels May 13, 2019

jschendel mentioned this issue May 15, 2019

DEPR: Deprecate ordered=None for CategoricalDtype #26403

Merged

4 tasks

jschendel closed this as completed in #26403 Jul 3, 2019

eriknw mentioned this issue Aug 9, 2022

[BUG] categoricals with .ordered is None in cudf and dask_cudf rapidsai/cudf#11487

Closed

pschumm mentioned this issue Oct 10, 2023

Add pattern supporting use of value labels, categoricals and factors frictionlessdata/datapackage#844

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate ordered=None for CategoricalDtype #26336

Deprecate ordered=None for CategoricalDtype #26336

TomAugspurger commented May 10, 2019

jschendel commented May 10, 2019 •

edited

Loading

jorisvandenbossche commented May 11, 2019

TomAugspurger commented May 13, 2019

jorisvandenbossche commented May 13, 2019

jschendel commented May 15, 2019

jorisvandenbossche commented May 15, 2019

jschendel commented May 15, 2019

jschendel commented May 21, 2019

jschendel commented May 21, 2019

jorisvandenbossche commented May 21, 2019

jorisvandenbossche commented May 21, 2019

jschendel commented May 21, 2019

jorisvandenbossche commented Jun 7, 2019 •

edited

Loading

Deprecate ordered=None for CategoricalDtype #26336

Deprecate ordered=None for CategoricalDtype #26336

Comments

TomAugspurger commented May 10, 2019

jschendel commented May 10, 2019 • edited Loading

jorisvandenbossche commented May 11, 2019

TomAugspurger commented May 13, 2019

jorisvandenbossche commented May 13, 2019

jschendel commented May 15, 2019

jorisvandenbossche commented May 15, 2019

jschendel commented May 15, 2019

jschendel commented May 21, 2019

jschendel commented May 21, 2019

jorisvandenbossche commented May 21, 2019

jorisvandenbossche commented May 21, 2019

jschendel commented May 21, 2019

jorisvandenbossche commented Jun 7, 2019 • edited Loading

jschendel commented May 10, 2019 •

edited

Loading

jorisvandenbossche commented Jun 7, 2019 •

edited

Loading