Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/DEPR: deprecate Categorical take default behaviour + fix Series[categorical].take #20841

Merged
merged 12 commits into from
Apr 30, 2018

Conversation

TomAugspurger
Copy link
Contributor

closes #20664

I might be unavailable for the next few hours, so I'm just putting this up here, even though it includes changes from #20814 in the first commit. Once that is merged, this can be merged on top of master and it'll just have the categorical changes.

commit ec0cecd
Author: Tom Augspurger <[email protected]>
Date:   Fri Apr 27 06:02:48 2018 -0500

    Updates

commit 6858409
Author: Tom Augspurger <[email protected]>
Date:   Fri Apr 27 05:48:59 2018 -0500

    Added note

commit eb43fa4
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 20:47:35 2018 -0500

    Really truly fix it hopefully.

commit 7c4f625
Merge: 9a6c7d4 6cacdde
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 20:40:15 2018 -0500

    Merge remote-tracking branch 'upstream/master' into ea-take

commit 9a6c7d4
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 20:04:17 2018 -0500

    Doc updates

commit eecd632
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 15:00:00 2018 -0500

    Skip categorical take tests

commit f3b91ca
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 13:43:26 2018 -0500

    doc fixup

commit fbc4425
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 13:37:45 2018 -0500

    Updates

    * indexer -> indices
    * doc user-facing vs physical
    * assert na_cmps
    * test reindex w/ non-NA fill_value

commit 741f284
Merge: 5db6624 630ef16
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 07:18:32 2018 -0500

    Merge remote-tracking branch 'upstream/master' into ea-take

commit 5db6624
Author: Tom Augspurger <[email protected]>
Date:   Thu Apr 26 07:17:30 2018 -0500

    Doc and move tests

commit 74b2c09
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 21:40:18 2018 -0500

    Added verisonadded

commit fc729d6
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 15:51:27 2018 -0500

    Fixed editor

commit 1a4d987
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 15:50:48 2018 -0500

    Pass an array

commit bbcbf19
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 15:07:28 2018 -0500

    Cleanup

commit d5470a0
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 15:02:26 2018 -0500

    Fixed reorder

commit 82cad8b
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 15:00:43 2018 -0500

    Stale comment

commit c449afd
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 14:48:33 2018 -0500

    Bounds checking

commit 449983b
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 12:55:31 2018 -0500

    Linting

commit 69e7fe7
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 12:40:20 2018 -0500

    Updates

    1. Reversed order of take keywords
    2. Added to extensions API
    3. Removed default implementation

commit 05d8844
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 09:59:33 2018 -0500

    Updated docs

commit 31cd304
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 09:43:45 2018 -0500

    pep8

commit 338566f
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 09:42:28 2018 -0500

    Upcasting

commit b7ae0bc
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 09:06:59 2018 -0500

    revert combine change

commit 125ca0b
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 08:37:07 2018 -0500

    Simplify

    Upcasting is still broken

commit c721915
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 07:50:54 2018 -0500

    Removed default_fill_value

commit 37915e9
Merge: 67ba9dd 60fe82c
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 07:44:15 2018 -0500

    Merge remote-tracking branch 'upstream/master' into ea-take

commit 67ba9dd
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 07:42:54 2018 -0500

    more with default fill value

commit eba137f
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 05:59:58 2018 -0500

    More internals hacking

commit 08f2479
Author: Tom Augspurger <[email protected]>
Date:   Wed Apr 25 05:59:17 2018 -0500

    Fixup JSON take

commit 0be9ec6
Author: Tom Augspurger <[email protected]>
Date:   Tue Apr 24 18:02:13 2018 -0500

    non-internals changes

commit dacd98e
Author: Tom Augspurger <[email protected]>
Date:   Tue Apr 24 14:45:36 2018 -0500

    Moves

commit fb3c234
Author: Tom Augspurger <[email protected]>
Date:   Tue Apr 24 13:59:51 2018 -0500

    [WIP]: ExtensionArray.take default implementation

    Implements a take interface that's compatible with NumPy and optionally pandas'
    NA semantics.

    ```python
    In [1]: import pandas as pd

    In [2]: from pandas.tests.extension.decimal.array import *

    In [3]: arr = DecimalArray(['1.1', '1.2', '1.3'])

    In [4]: arr.take([0, 1, -1])
    Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object))

    In [5]: arr.take([0, 1, -1], fill_value=float('nan'))
    Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object))
    ```

    Closes pandas-dev#20640
We're changing how Categorical.take handles negative indices to be in
line with Series and other EAs.
@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Apr 27, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 27, 2018
@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

Merging #20841 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20841      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49313    49324      +11     
==========================================
+ Hits        45259    45272      +13     
+ Misses       4054     4052       -2
Flag Coverage Δ
#multiple 90.17% <100%> (ø) ⬆️
#single 41.91% <75%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.61% <100%> (+0.04%) ⬆️
pandas/core/series.py 94.01% <100%> (+0.01%) ⬆️
pandas/util/testing.py 84.59% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 563a6ad...12485c5. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

OK, this is clean on top of master now.

@TomAugspurger TomAugspurger changed the title [WIP]Categorical take Categorical take Apr 27, 2018
@TomAugspurger
Copy link
Contributor Author

Didn't see any warnings in the test logs. Should be good to go.

cc @jorisvandenbossche

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Some minor comments. Maybe you can add a whatsnew notice in the bug fixes as well for the Series.take one.

@@ -856,6 +856,7 @@ Other API Changes
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
- A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indicies from the right (:issue:`20664`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicies -> indices

I would also add "to be consistent with Series.take" at the end

""" Take the codes by the indexer, fill with the fill_value.
def take_nd(self, indexer, allow_fill=None, fill_value=None):
"""
Return the indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Take elements from the Categorical ? (in any case, it are not the indices that are returned :-))

assert isna(fill_value)
if fill_value is None or isna(fill_value):
# For backwards compatability, we have to override
# any na values for `fill_value`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to explain why we have the isna(fill_value). In the past we checked for fill_value being null, and then ignoring it and using -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But np.nan is the NA value for Categorical, so shouldn't its take method be able to check that anyhow (regardless from back compat reasons) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I understand. Will fix

@@ -753,6 +753,16 @@ def test_take():
s.take([-1, 3, 4], convert=False)


def test_take_categorical():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this test (or in addition to), I think you should be able to remove the skip from test_take_series in the categorical extension tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still fails because of #20747. The unobserved categories are still present. I've updated the skip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that's still there :)

@TomAugspurger
Copy link
Contributor Author

Maybe you can add a whatsnew notice in the bug fixes as well for the Series.take

Was that ever in a released version? I thought it was just master where we broke that.

@jorisvandenbossche
Copy link
Member

Was that ever in a released version? I thought it was just master where we broke that.

No, it was a regression from 0.20 -> 0.21/22

@TomAugspurger
Copy link
Contributor Author

Last failure was random. Fixed (properly this time) by d1c7e38.

Merging on green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since u didn’t change any tests are there no warnings anywhere??

@@ -5,6 +5,18 @@
import pandas.util.testing as tm


@pytest.fixture(params=[True, False])
def allow_fill(request):
"""Boolean 'allow_fill' parameter for Categorical.take"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can prob be moved into a higher level conftest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do when that's necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls make a conftest here, otherwise these are easily lost

# https://github.com/pandas-dev/pandas/issues/20664
# TODO: remove when the default Categorical.take behavior changes
kwargs = {'allow_fill': False}
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you changed the default? it was True before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that take=False wasn't passed through here was a regression from 0.20. The default for Categorical.take is still True (None -> warning -> True).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though since allow_fill isn't a keyword for _take we can just pass it unconditionally here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind. allow_fill isn't a keyword for ndarray.take. That's why the kwargs is needed.

@TomAugspurger
Copy link
Contributor Author

since u didn’t change any tests are there no warnings anywhere??

Didn't see any warnings in the test logs. Should be good to go.

We apparently didn't test take directly.

@jorisvandenbossche jorisvandenbossche changed the title Categorical take BUG/DEPR: deprecate Categorical take default behaviour + fix Series[categorical].take Apr 28, 2018
@@ -1024,6 +1025,7 @@ Categorical
- Bug in ``Categorical.__iter__`` not converting to Python types (:issue:`19909`)
- Bug in :func:`pandas.factorize` returning the unique codes for the ``uniques``. This now returns a ``Categorical`` with the same dtype as the input (:issue:`19721`)
- Bug in :func:`pandas.factorize` including an item for missing values in the ``uniques`` return value (:issue:`19721`)
- Bug in :meth:`Series.take` with categorical data interpreting ``-1`` in `indices` as missing value markers, rather than the last element of the Series (:issue:`20664`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks on indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single backticks for parameter names.

indexer = np.asarray(indexer, dtype=np.intp)
if allow_fill is None:
if (indexer < 0).any():
warn(_take_msg, FutureWarning, stacklevel=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just put the warning inline here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That messes with the formatting.

allow_fill = True

if isna(fill_value):
# For categorical, any NA value is considered a user-facing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this condition tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@@ -5,6 +5,18 @@
import pandas.util.testing as tm


@pytest.fixture(params=[True, False])
def allow_fill(request):
"""Boolean 'allow_fill' parameter for Categorical.take"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls make a conftest here, otherwise these are easily lost

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor doc comment, otherwise lgtm.

@@ -856,6 +856,7 @@ Other API Changes
- Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`).
- :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`)
- A user-defined-function that is passed to :func:`Series.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, :func:`DataFrame.rolling().aggregate() <pandas.core.window.Rolling.aggregate>`, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here <whatsnew_0230.enhancements.window_raw>`. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`)
- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indices from the right. The future behavior is consistent with :meth:`Series.take` (:issue:`20664`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be in the deprecations section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@TomAugspurger TomAugspurger mentioned this pull request Apr 30, 2018
71 tasks
@jorisvandenbossche
Copy link
Member

This was passing before, so merging.

@jorisvandenbossche jorisvandenbossche merged commit 4afc756 into pandas-dev:master Apr 30, 2018
@TomAugspurger TomAugspurger deleted the categorical-take branch April 30, 2018 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Categorical.take and Series([Categorical]).take is inconsistent with other dtypes
3 participants