-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate ordered=None for CategoricalDtype #26336
Comments
There is some reliance on this for Basically: In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+520.g2ef50aea04'
In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)
In [3]: s = pd.Series(list('abcdaba'), dtype=cdt1)
In [4]: cdt2 = pd.api.types.CategoricalDtype(categories=list('cedafb'))
In [5]: cdt2.ordered is None
Out[5]: True
In [6]: s = s.astype(cdt2)
In [7]: s.dtype.ordered
Out[7]: True Having |
That would indeed be a breaking change, so should we maybe see if we need to deprecate it first? (not sure it is worth it) Personally, seeing that example, I would say that's even an extra reason to change it, because I find it very confusing that doing a |
Thanks @jschendel. I agree with
How do we actually deprecate this behavior without breaking things? Changing the default in |
Change the default to some other sentinel that we can catch? ( |
Is this necessary? My understanding is that a different sentinel is used when we want to distinguish between when a user explicitly passes |
Ah, yes, you're right :) Of course, there still might be the case where the user actually passes |
Seems easiest to warn in both cases, and probably more user friendly too. Opening a PR shortly. |
The tests revealed one additional breaking case: overriding the dtype of an existing In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+554.g3b24fb678'
In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)
In [3]: cdt2 = pd.api.types.CategoricalDtype(categories=list('cedafb'))
In [4]: cat = pd.Categorical(list('abcdaba'), dtype=cdt1)
In [5]: pd.Series(cat, dtype=cdt2).dtype
Out[5]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=True) Interestingly, this doesn't appear to happen with other constructors: In [6]: pd.Categorical(cat, dtype=cdt2).dtype
Out[6]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)
In [7]: pd.CategoricalIndex(cat, dtype=cdt2).dtype
Out[7]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)
In [8]: pd.Index(cat, dtype=cdt2).dtype
Out[8]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False)
In [9]: pd.array(cat, dtype=cdt2).dtype
Out[9]: CategoricalDtype(categories=['c', 'e', 'd', 'a', 'f', 'b'], ordered=False) Will update the PR accordingly. |
Looks like there's one additional complicating factor. When passing the string This gets a little messy in the In [1]: import pandas as pd
In [2]: cdt1 = pd.api.types.CategoricalDtype(categories=list('cdab'), ordered=True)
In [3]: cat = pd.Categorical(list('abcdaba'), dtype=cdt1)
In [4]: pd.Series(cat, dtype='category').dtype
Out[4]: CategoricalDtype(categories=['c', 'd', 'a', 'b'], ordered=True) The current implementation in the Lines 170 to 171 in f5cc078
|
Ah, yes. I actually think this was the main reason to have this ordered=None, to be able to represent the meaning of the string 'category' dtype description .. |
Another option might be to actually keep this Unless we actually want to deprecate that behaviour of |
I'm in favor of this, as keeping a |
What shall we do here? @jschendel to what state of the discussion is the PR updated? (is it ready to review?) |
xref #26327 (comment).
We currently allow CategoricalDtype.ordered to be None. It should just be a bool (default False).
categories can be None, which means "infer later on". But (IIRC) we don't do that for ordered.
The text was updated successfully, but these errors were encountered: