-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: Deprecate ordered=None for CategoricalDtype #26403
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26403 +/- ##
==========================================
- Coverage 91.69% 91.68% -0.01%
==========================================
Files 174 174
Lines 50743 50746 +3
==========================================
- Hits 46527 46526 -1
- Misses 4216 4220 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26403 +/- ##
=========================================
Coverage ? 41.69%
=========================================
Files ? 174
Lines ? 50760
Branches ? 0
=========================================
Hits ? 21162
Misses ? 29598
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a clean change!
The one case that is not covered (which I mentioned in the issue) is when passing it manually to the constructor:
CategoricalDtype(categories=[..], ordered=None)
assuming we would disallow that in the future?
pandas/core/dtypes/dtypes.py
Outdated
msg = ("ordered=None is deprecated and will default to False " | ||
"in a future version; ordered=True must be explicitly " | ||
"passed in order to be retained") | ||
warnings.warn(msg, FutureWarning, stacklevel=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should let the stacklevel fit for the astype case (although the series and index case might need a different stacklevel), as that seems the more common case compared to directly using this method?
Or, otherwise, I would certainly try to make the context clearer in this warning message: indicate that a CategoricalDtype was constructed without specifying the ordered, etc (otherwise the message might be very confusing, as it is not raised when creating it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the stacklevel
so that it works for the Categorical.astype
and CategoricalIndex.astype
cases; all other cases look like they require a unique stacklevel
. I've also updated the message itself to be more clear.
Actually, we could still accept it but turn it into False? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any tests which are producing this warning that are not caught?
One slight concern, In the future, I'm not sure how we could easily raise a warning there (on attribute access) without being very annoying. |
One option might to be already change it now? (to prevent more users to start relying on it) We could already return False when the user asks for it and internally None was stored? |
Yes, this is the reason CI failed on my first commit. The warning came from a case we weren't currently addressing, which I detailed in #26336 (comment)
We do this a lot internally in the codebase as well, so it seems like it'd be hard to do this is a non-invasive way that doesn't break things. Will look into this in the next couple days, likewise with |
a1db25d
to
060eb08
Compare
can you merge master |
Updated to address remaining comments from the issue, as well as align on comments from the sprint. Summary of the issue to refresh people's memory
Summary of impacted behavior
Implementation comments For example, with In [1]: from pandas.api.types import CategoricalDtype
In [2]: cdt = CategoricalDtype(['a', 'b'])
In [3]: cdt.ordered
/home/jeremy/code/pandas/pandas/core/dtypes/dtypes.py:589: FutureWarning: Constructing a
CategoricalDtype without specifying `ordered` will default to `ordered=False` in a future
version; `ordered=None` must be explicitly passed.
warnings.warn(msg, FutureWarning) Explicitly passing In [4]: CategoricalDtype(['a', 'b'], ordered=None).ordered is None
Out[4]: True
In [5]: CategoricalDtype(['a', 'b'], ordered=True).ordered
Out[5]: True
In [6]: CategoricalDtype(['a', 'b'], ordered=False).ordered
Out[6]: False |
pandas/core/dtypes/dtypes.py
Outdated
self.validate_ordered(ordered) | ||
|
||
if categories is not None: | ||
categories = self.validate_categories(categories, | ||
fastpath=fastpath) | ||
|
||
self._categories = categories | ||
self._ordered = ordered | ||
self._ordered = ordered if ordered is not sentinel else None | ||
self._ordered_from_sentinel = ordered is sentinel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit gross but seemed to be the cleanest way to implement things in a non-intrusive way (i.e. only warning when behavior would silently change).
Internal code can't reference .ordered
because it could trigger the warning due to under the hood operations that the user has no control over (e.g. handling 'category'
as dtype), so internal code needs to reference ._ordered
. Instead of special casing sentinel
every time ._ordered
is referenced seemed cleaner to add an additional attribute to keep track of things.
I tried to change the internals so that .ordered
didn't trigger the warning, but I wasn't successfully able to do so, and my failing attempt had already gotten more complex than the .ordered
--> ._ordered
change, so didn't seem worthwhile to pursue further.
Note that the .ordered
--> ._ordered
change only applies to CategoricalDtype
; this change does not appear to be needed to Categorical
or CategoricalIndex
, as those will always resolve to a boolean.
I'll volunteer to remove this and revert _ordered
to ordered
when we go through with changing the default value in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll volunteer to remove this and revert _ordered to ordered when we go through with changing the default value in the constructor.
Showtime!
@jschendel Thanks for the detailed and clear summary! |
Thanks for the summary in #26403 (comment). That all sounds good, haven't had a chance to look at the changes yet. Probably won't today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some small questions
pandas/core/dtypes/dtypes.py
Outdated
msg = ("Constructing a CategoricalDtype without specifying " | ||
"`ordered` will default to `ordered=False` in a future " | ||
"version; `ordered=True` must be explicitly passed in " | ||
"order to be retained") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explain what the use case is here? (maybe we don't really want to encourage using it ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some more detail to the message, but agree that we don't really want to encourage using ordered=None
so I didn't specifically mention it
[a, b, c, a, b, a] | ||
Categories (3, object): [c < b < a] | ||
|
||
In [5]: cdt = CategoricalDtype(categories=list('cbad')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an old whatsnew note references behavior that will be changed, so switched over to an ipython code block to keep the previous behavior static when this gets generated in the future
@@ -331,7 +331,7 @@ def __init__(self, values, categories=None, ordered=None, dtype=None, | |||
# sanitize input | |||
if is_categorical_dtype(values): | |||
if dtype.categories is None: | |||
dtype = CategoricalDtype(values.categories, dtype.ordered) | |||
dtype = CategoricalDtype(values.categories, dtype._ordered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really need to user the internal one here? (and all others); I really don't want to expose that even to our code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the overview that Jeremy gave above (#26403 (comment)) and this comment for more details on why _ordered
was needed: #26403 (comment)
That explains clearly the context on why this was added, and I am fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine then, @jschendel can you create an issue for this so we don't forget at removal time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, back at work today so will create the issue later on tonight
expected_ordered = dtype.ordered | ||
|
||
result = dtype.update_dtype(new_dtype) | ||
# GH 26336 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give some explanation on what you are testing here (the cases)
this looks fine. @jschendel can you rebase. |
actually was able to use the conflict resolver. merge on green. |
git diff upstream/master -u -- "*.py" | flake8 --diff
All of the logic related to breaking changes looks to be consolidated in
CategoricalDtype.update_dtype
, so I'm raising the warning there when a breaking change would occur and letting it flow through from there.Aside from
CategoricalDtype.update_dtype
, I think the only other place where a breaking change can manifest is inastype
as described in the issue. Will need to double check the test logs to see if anything else triggered a warning though.cc @TomAugspurger @jorisvandenbossche