Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Deprecate ordered=None for CategoricalDtype #26403

Merged
merged 14 commits into from
Jul 3, 2019

Conversation

jschendel
Copy link
Member

All of the logic related to breaking changes looks to be consolidated in CategoricalDtype.update_dtype, so I'm raising the warning there when a breaking change would occur and letting it flow through from there.

Aside from CategoricalDtype.update_dtype, I think the only other place where a breaking change can manifest is in astype as described in the issue. Will need to double check the test logs to see if anything else triggered a warning though.

cc @TomAugspurger @jorisvandenbossche

@jschendel jschendel added Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type Deprecate Functionality to remove in pandas labels May 15, 2019
@jschendel jschendel added this to the 0.25.0 milestone May 15, 2019
@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #26403 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26403      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50743    50746       +3     
==========================================
- Hits        46527    46526       -1     
- Misses       4216     4220       +4
Flag Coverage Δ
#multiple 90.19% <100%> (ø) ⬆️
#single 41.17% <33.33%> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/dtypes.py 96.69% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b24fb6...adc0bca. Read the comment docs.

@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@ddec4eb). Click here to learn what that means.
The diff coverage is 28.57%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #26403   +/-   ##
=========================================
  Coverage          ?   41.69%           
=========================================
  Files             ?      174           
  Lines             ?    50760           
  Branches          ?        0           
=========================================
  Hits              ?    21162           
  Misses            ?    29598           
  Partials          ?        0
Flag Coverage Δ
#single 41.69% <28.57%> (?)
Impacted Files Coverage Δ
pandas/io/packers.py 14.57% <0%> (ø)
pandas/core/dtypes/dtypes.py 73.62% <33.33%> (ø)
pandas/core/series.py 45.76% <50%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ddec4eb...bc16e98. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clean change!

The one case that is not covered (which I mentioned in the issue) is when passing it manually to the constructor:

CategoricalDtype(categories=[..], ordered=None)

assuming we would disallow that in the future?

msg = ("ordered=None is deprecated and will default to False "
"in a future version; ordered=True must be explicitly "
"passed in order to be retained")
warnings.warn(msg, FutureWarning, stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should let the stacklevel fit for the astype case (although the series and index case might need a different stacklevel), as that seems the more common case compared to directly using this method?

Or, otherwise, I would certainly try to make the context clearer in this warning message: indicate that a CategoricalDtype was constructed without specifying the ordered, etc (otherwise the message might be very confusing, as it is not raised when creating it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the stacklevel so that it works for the Categorical.astype and CategoricalIndex.astype cases; all other cases look like they require a unique stacklevel. I've also updated the message itself to be more clear.

@jorisvandenbossche
Copy link
Member

[about passing ordered=None explicitly] assuming we would disallow that in the future?

Actually, we could still accept it but turn it into False?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any tests which are producing this warning that are not caught?

@TomAugspurger
Copy link
Contributor

One slight concern, In the future, CategoricalDtype().ordered will change without warning, right? Right now it's None, in the future it'll be False.

I'm not sure how we could easily raise a warning there (on attribute access) without being very annoying.

@jorisvandenbossche
Copy link
Member

One slight concern, In the future, CategoricalDtype().ordered will change without warning, right? Right now it's None, in the future it'll be False.

One option might to be already change it now? (to prevent more users to start relying on it) We could already return False when the user asks for it and internally None was stored?

@jschendel
Copy link
Member Author

do we have any tests which are producing this warning that are not caught?

Yes, this is the reason CI failed on my first commit. The warning came from a case we weren't currently addressing, which I detailed in #26336 (comment)

The one case that is not covered (which I mentioned in the issue) is when passing it manually to the constructor

We do this a lot internally in the codebase as well, so it seems like it'd be hard to do this is a non-invasive way that doesn't break things. Will look into this in the next couple days, likewise with CategoricalDtype().ordered.

pandas/core/generic.py Outdated Show resolved Hide resolved
@jschendel jschendel force-pushed the depr-cdt-ordered-none branch from a1db25d to 060eb08 Compare May 21, 2019 20:40
@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

can you merge master

@jschendel
Copy link
Member Author

Updated to address remaining comments from the issue, as well as align on comments from the sprint.

Summary of the issue to refresh people's memory

  • Deprecated CategoricalDtype default value ordered=None in favor of ordered=False
  • Still allow ordered=None if explicitly passed, in part to maintain behavior for the string 'category'.

Summary of impacted behavior

Implementation comments
I tried to implement the changes in the least intrusive way; a warning will only be raised in scenarios where the existing behavior would silently change.

For example, with CategoricalDtype.ordered, the warning will only appear if no value for None is specified. Note that constructing a CategoricalDtype without specifying None will not warn, as the behavior may not be silently changing at this point; only upon accessing .ordered will a warning appear.

In [1]: from pandas.api.types import CategoricalDtype

In [2]: cdt = CategoricalDtype(['a', 'b'])

In [3]: cdt.ordered
/home/jeremy/code/pandas/pandas/core/dtypes/dtypes.py:589: FutureWarning: Constructing a
CategoricalDtype without specifying `ordered` will default to `ordered=False` in a future
version; `ordered=None` must be explicitly passed.
  warnings.warn(msg, FutureWarning)

Explicitly passing None will not warn when .ordered is accessed, since this behavior will still be allowed. Likewise explicitly passing True or False will not warn.

In [4]: CategoricalDtype(['a', 'b'], ordered=None).ordered is None
Out[4]: True

In [5]: CategoricalDtype(['a', 'b'], ordered=True).ordered
Out[5]: True

In [6]: CategoricalDtype(['a', 'b'], ordered=False).ordered
Out[6]: False

self.validate_ordered(ordered)

if categories is not None:
categories = self.validate_categories(categories,
fastpath=fastpath)

self._categories = categories
self._ordered = ordered
self._ordered = ordered if ordered is not sentinel else None
self._ordered_from_sentinel = ordered is sentinel
Copy link
Member Author

@jschendel jschendel Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit gross but seemed to be the cleanest way to implement things in a non-intrusive way (i.e. only warning when behavior would silently change).

Internal code can't reference .ordered because it could trigger the warning due to under the hood operations that the user has no control over (e.g. handling 'category' as dtype), so internal code needs to reference ._ordered. Instead of special casing sentinel every time ._ordered is referenced seemed cleaner to add an additional attribute to keep track of things.

I tried to change the internals so that .ordered didn't trigger the warning, but I wasn't successfully able to do so, and my failing attempt had already gotten more complex than the .ordered --> ._ordered change, so didn't seem worthwhile to pursue further.

Note that the .ordered --> ._ordered change only applies to CategoricalDtype; this change does not appear to be needed to Categorical or CategoricalIndex, as those will always resolve to a boolean.

I'll volunteer to remove this and revert _ordered to ordered when we go through with changing the default value in the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll volunteer to remove this and revert _ordered to ordered when we go through with changing the default value in the constructor.

Showtime!

@jorisvandenbossche
Copy link
Member

@jschendel Thanks for the detailed and clear summary!
On board with the proposed changes and implementation details

@TomAugspurger
Copy link
Contributor

Thanks for the summary in #26403 (comment). That all sounds good, haven't had a chance to look at the changes yet. Probably won't today.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some small questions

pandas/core/dtypes/dtypes.py Outdated Show resolved Hide resolved
msg = ("Constructing a CategoricalDtype without specifying "
"`ordered` will default to `ordered=False` in a future "
"version; `ordered=True` must be explicitly passed in "
"order to be retained")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explain what the use case is here? (maybe we don't really want to encourage using it ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some more detail to the message, but agree that we don't really want to encourage using ordered=None so I didn't specifically mention it

pandas/tests/dtypes/test_dtypes.py Show resolved Hide resolved
[a, b, c, a, b, a]
Categories (3, object): [c < b < a]

In [5]: cdt = CategoricalDtype(categories=list('cbad'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an old whatsnew note references behavior that will be changed, so switched over to an ipython code block to keep the previous behavior static when this gets generated in the future

@@ -331,7 +331,7 @@ def __init__(self, values, categories=None, ordered=None, dtype=None,
# sanitize input
if is_categorical_dtype(values):
if dtype.categories is None:
dtype = CategoricalDtype(values.categories, dtype.ordered)
dtype = CategoricalDtype(values.categories, dtype._ordered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need to user the internal one here? (and all others); I really don't want to expose that even to our code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the overview that Jeremy gave above (#26403 (comment)) and this comment for more details on why _ordered was needed: #26403 (comment)
That explains clearly the context on why this was added, and I am fine with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine then, @jschendel can you create an issue for this so we don't forget at removal time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, back at work today so will create the issue later on tonight

expected_ordered = dtype.ordered

result = dtype.update_dtype(new_dtype)
# GH 26336
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give some explanation on what you are testing here (the cases)

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

this looks fine. @jschendel can you rebase.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

actually was able to use the conflict resolver. merge on green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate ordered=None for CategoricalDtype
5 participants