Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Allow ordered=None in CategoricalDtype #18889

Merged
merged 4 commits into from
Feb 10, 2018

Conversation

jschendel
Copy link
Member

For equality comparisons with ordered=None, I essentially treated it as if it where ordered=False:

  • CDT(['a', 'b'], None) == CDT(['a', 'b'], False) --> True
  • CDT(['a', 'b'], None) == CDT(['b', 'a'], False) --> True
  • CDT(['a', 'b'], None) == CDT(['a', 'b'], True) --> False

This maintains existing comparison behavior when ordered is not specified:

  • CDT(['a', 'b'], False) == CDT(['a', 'b']) --> True
  • CDT(['a', 'b'], True) == CDT(['a', 'b']) --> False

I didn't make any code modifications in regards to hashing, so CDT(*, None) will have the same hash as CDT(*, False). This seems to be consistent with how equality is treated. Makes the logic implementing equality nicer too, since the case when both dtypes are unordered relies on hashes.

@jreback jreback added the Categorical Categorical Data Type label Dec 21, 2017
@@ -361,11 +359,16 @@ def _update_dtype(self, dtype):
'got {dtype!r}').format(dtype=dtype)
raise ValueError(msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate, I don't think we need to have private methods on CDT. e.g. validate_ordered and update_dtype (maybe more). As this is used in other contexts. so a de-privatize PR should be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can you change this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done for _update_dtype, _validate_ordered, and _validate_categories

@@ -198,6 +198,7 @@ Other API Changes
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()` (:issue:`16672`)
- :func:`pandas.merge` now raises a ``ValueError`` when trying to merge on incompatible data types (:issue:`9780`)
- :func:`wide_to_long` previously kept numeric-like suffixes as ``object`` dtype. Now they are cast to numeric if possible (:issue:`17627`)
- The default value of the ``ordered`` parameter for :class:`~pandas.api.types.CategoricalDtype` has changed from ``False`` to ``None``. Behavior should remain consistent for downstream objects, such as :class:`Categorical` (:issue:`18790`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase on master. can you move to 0.23 (docs were renamed), prob easiest to just check this file from master and past in new one

@codecov
Copy link

codecov bot commented Dec 22, 2017

Codecov Report

Merging #18889 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18889      +/-   ##
==========================================
+ Coverage   91.62%   91.62%   +<.01%     
==========================================
  Files         150      150              
  Lines       48798    48798              
==========================================
+ Hits        44709    44711       +2     
+ Misses       4089     4087       -2
Flag Coverage Δ
#multiple 89.99% <100%> (ø) ⬆️
#single 41.74% <90%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 94.87% <100%> (ø) ⬆️
pandas/core/indexes/category.py 97.26% <100%> (ø) ⬆️
pandas/core/dtypes/dtypes.py 96.08% <100%> (ø) ⬆️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6485a36...007340d. Read the comment docs.

@jschendel jschendel force-pushed the cdt-ordered-none branch 2 times, most recently from e312d24 to cf6107e Compare January 1, 2018 18:46
return other.ordered and self.categories.equals(other.categories)
elif other.ordered:
return False
elif self.ordered or other.ordered:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can ordered be None here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, e.g. CDT(list('abcd'), None) == CDT(list('dcba'), None), but None is fine here since it is Falsey, and this is just meant to catch cases where at least one is True

@@ -361,11 +359,16 @@ def _update_dtype(self, dtype):
'got {dtype!r}').format(dtype=dtype)
raise ValueError(msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can you change this?

c1 = CategoricalDtype(list('abc'), ordered1)
c2 = CategoricalDtype(list('abc'), ordered2)
result = c1 == c2
expected = (ordered1 is ordered2) or not any([ordered1, ordered2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a comment here about what is expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment a few lines up, and simplified the logic for expected

@@ -208,6 +208,7 @@ Other API Changes
- In :func:`read_excel`, the ``comment`` argument is now exposed as a named parameter (:issue:`18735`)
- Rearranged the order of keyword arguments in :func:`read_excel()` to align with :func:`read_csv()` (:issue:`16672`)
- The options ``html.border`` and ``mode.use_inf_as_null`` were deprecated in prior versions, these will now show ``FutureWarning`` rather than a ``DeprecationWarning`` (:issue:`19003`)
- The default value of the ``ordered`` parameter for :class:`~pandas.api.types.CategoricalDtype` has changed from ``False`` to ``None``. Behavior should remain consistent for downstream objects, such as :class:`Categorical` (:issue:`18790`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your other PR mention why we made this change (updating just categories and not changing ordered)?

Copy link
Member Author

@jschendel jschendel Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so; updated the whatsnew entry to specify this

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments inline.

@jschendel
Copy link
Member Author

@jreback @TomAugspurger : Sorry, forgot about this! Made the requested changes.

@@ -438,6 +438,7 @@ Other API Changes
- Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`)
- :class:`DateOffset` objects render more simply, e.g. "<DateOffset: days=1>" instead of "<DateOffset: kwds={'days': 1}>" (:issue:`19403`)
- :func:`pandas.merge` provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (:issue:`15800`)
- The default value of the ``ordered`` parameter for :class:`~pandas.api.types.CategoricalDtype` has changed from ``False`` to ``None`` to allow updating of ``categories`` without impacting ``ordered``. Behavior should remain consistent for downstream objects, such as :class:`Categorical` (:issue:`18790`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a subsection

elif other.ordered:
return False
elif self.ordered or other.ordered:
# at least one ordered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expand on the comment here (kind of like you did in the tests below), to enumerate the cases for ordered == ordered (and when they match / don't match)

@jreback jreback added this to the 0.23.0 milestone Feb 10, 2018
@jreback jreback merged commit fe972fb into pandas-dev:master Feb 10, 2018
@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

thanks @jschendel as always very nice patches! keep coming!

IF you are interested in working on some of the new interval indexing things (the _new test files), would be great.

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
@jschendel jschendel deleted the cdt-ordered-none branch September 24, 2018 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DISC: Behavior of .astype('category') on existing categorical data
3 participants