Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix groupby over a CategoricalIndex in axis=1 #18525

Conversation

ekisslinger
Copy link
Contributor

@ekisslinger ekisslinger commented Nov 27, 2017

@gfyoung gfyoung added Bug Categorical Categorical Data Type Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 27, 2017
@@ -2859,7 +2859,7 @@ def is_in_obj(gpr):
else:
in_axis, name = False, None

if is_categorical_dtype(gpr) and len(gpr) != len(obj):
if is_categorical_dtype(gpr) and len(gpr) != obj.shape[axis]:
raise ValueError("Categorical dtype grouper must "
"have len(grouper) == len(data)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the change in the check, perhaps this error message should change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll change it to "Grouper and axis must be same length".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in the actual len(grouper) and len(axis) as well

@gfyoung
Copy link
Member

gfyoung commented Nov 27, 2017

@ekisslinger : Don't forget to add a whatsnew for 0.22.0

@ekisslinger
Copy link
Contributor Author

ekisslinger commented Nov 27, 2017

Any chance this fix can be added to 0.21.1?

@codecov
Copy link

codecov bot commented Nov 27, 2017

Codecov Report

Merging #18525 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18525      +/-   ##
==========================================
+ Coverage   91.32%   91.33%   +<.01%     
==========================================
  Files         163      164       +1     
  Lines       49798    49819      +21     
==========================================
+ Hits        45479    45503      +24     
+ Misses       4319     4316       -3
Flag Coverage Δ
#multiple 89.13% <100%> (+0.02%) ⬆️
#single 40.8% <0%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.03% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/util/_test_decorators.py 95.23% <0%> (ø)
pandas/core/indexes/datetimes.py 95.71% <0%> (+0.09%) ⬆️
pandas/plotting/_converter.py 65.25% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 262e8ff...2a11e78. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 27, 2017

Codecov Report

Merging #18525 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18525      +/-   ##
==========================================
+ Coverage   91.32%   91.33%   +<.01%     
==========================================
  Files         163      164       +1     
  Lines       49798    49819      +21     
==========================================
+ Hits        45479    45503      +24     
+ Misses       4319     4316       -3
Flag Coverage Δ
#multiple 89.13% <100%> (+0.02%) ⬆️
#single 40.8% <0%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.03% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/util/_test_decorators.py 95.23% <0%> (ø)
pandas/core/indexes/datetimes.py 95.71% <0%> (+0.09%) ⬆️
pandas/plotting/_converter.py 65.25% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 262e8ff...2a11e78. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 27, 2017

Codecov Report

Merging #18525 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18525      +/-   ##
==========================================
- Coverage   91.35%   91.33%   -0.02%     
==========================================
  Files         164      164              
  Lines       49802    49802              
==========================================
- Hits        45496    45487       -9     
- Misses       4306     4315       +9
Flag Coverage Δ
#multiple 89.13% <100%> (ø) ⬆️
#single 40.81% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.04% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7627cca...19f6041. Read the comment docs.

@gfyoung
Copy link
Member

gfyoung commented Nov 27, 2017

Any chance this fix can be added 0.21.1?

@jreback @jorisvandenbossche : Thoughts?

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

will look in a bit

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

this would be ok for 0.21.1, pls add a whatsnew note

@@ -2859,7 +2859,7 @@ def is_in_obj(gpr):
else:
in_axis, name = False, None

if is_categorical_dtype(gpr) and len(gpr) != len(obj):
if is_categorical_dtype(gpr) and len(gpr) != obj.shape[axis]:
raise ValueError("Categorical dtype grouper must "
"have len(grouper) == len(data)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in the actual len(grouper) and len(axis) as well

@@ -191,6 +191,25 @@ def test_groupby_categorical_index(self):
[0, 1, 2, 3], levels, ordered=True), name='cats')
assert_frame_equal(result, expected)

def test_groupby_categorical_columns_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to pandas/tests/groupby/test_grouper.py, which is where things like this are tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't find that file...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, test_grouping.py (move both this version and the transposed)

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

also add the transposed version of this in your test (which works now)

@ekisslinger
Copy link
Contributor Author

I think test_groupby_categorical_index() in pandas/tests/groupby/test_categorical.py tests the transposed version (i.e. group over axis 0). Here is the relevant code:

   def test_groupby_categorical_index(self):

        s = np.random.RandomState(12345)
        levels = ['foo', 'bar', 'baz', 'qux']
        codes = s.randint(0, 4, size=20)
        cats = Categorical.from_codes(codes, levels, ordered=True)
        df = DataFrame(
            np.repeat(
                np.arange(20), 4).reshape(-1, 4), columns=list('abcd'))
        df['cats'] = cats

        # with a cat index
        result = df.set_index('cats').groupby(level=0).sum()
        expected = df[list('abcd')].groupby(cats.codes).sum()
        expected.index = CategoricalIndex(
            Categorical.from_codes(
                [0, 1, 2, 3], levels, ordered=True), name='cats')
        assert_frame_equal(result, expected)

@ekisslinger ekisslinger force-pushed the fix-groupby-over-columns-categoricalindex branch from 2a11e78 to a05147f Compare November 28, 2017 17:50
@ekisslinger
Copy link
Contributor Author

I updated the bug fix branch. The changes are:

  1. Add whatsnew entry to 0.21.1
  2. Fix error message
  3. Add multi-index version to associated unit test

@@ -137,6 +137,7 @@ Categorical
- Error messages in the testing module have been improved when items have different ``CategoricalDtype`` (:issue:`18069`)
- ``CategoricalIndex`` can now correctly take a ``pd.api.types.CategoricalDtype`` as its dtype (:issue:`18116`)
- Bug in ``Categorical.unique()`` returning read-only ``codes`` array when all categories were ``NaN`` (:issue:`18051`)
- Bug when grouping over a ``CategoricalIndex`` in axis=1 (:issue:`18432`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame.groupby(axis=1) with a CategoricalIndex

"have len(grouper) == len(data)")
if is_categorical_dtype(gpr) and len(gpr) != obj.shape[axis]:
raise ValueError(
("Length of grouper ({0}) and axis ({1}) must be same length"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use kwargs rather than positional formatters

@@ -191,6 +191,25 @@ def test_groupby_categorical_index(self):
[0, 1, 2, 3], levels, ordered=True), name='cats')
assert_frame_equal(result, expected)

def test_groupby_categorical_columns_index(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, test_grouping.py (move both this version and the transposed)

@@ -191,6 +191,45 @@ def test_groupby_categorical_index(self):
[0, 1, 2, 3], levels, ordered=True), name='cats')
assert_frame_equal(result, expected)

def test_groupby_categorical_columns_index(self):
# GH18432
s = np.random.RandomState(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary to set a random key here

@ekisslinger ekisslinger force-pushed the fix-groupby-over-columns-categoricalindex branch from a05147f to d7640e8 Compare November 29, 2017 00:54
@ekisslinger
Copy link
Contributor Author

i made the requested changes

columns = ['A', 'B', 'A', 'B']
categories = ['B', 'A']
cat_columns = CategoricalIndex(columns,
categories=categories,
Copy link
Contributor

@jreback jreback Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this test to instead of using random data, uses integer as the data, then directly construct the result frame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I made the change. I think it is what you had in mind and I agree it is more straightforward this way.

closes GH18432

Add multi-index columns test to test_groupby_categorical_columns_index()

Add whatsnew for GH18432 bug fix

Fix ValueError text for GH18432 bug fix

Update whatsnew text

Use kwargs instead of positional format params

Move test_groupby_categorical_columns_index() to pandas/tests/groupby/test_grouping.py

Directly construct expected dataframe in test_groupby_categorical_index_and_columns()
@ekisslinger ekisslinger force-pushed the fix-groupby-over-columns-categoricalindex branch from d7640e8 to 19f6041 Compare November 29, 2017 15:25
@ekisslinger
Copy link
Contributor Author

In case you didn't see the last message, as it is a bit hidden, I made the last change which was to use integer instead of random data and directly construct the dataframe.

@jreback jreback added this to the 0.21.1 milestone Nov 30, 2017
@jreback jreback merged commit 5da3759 into pandas-dev:master Nov 30, 2017
@jreback
Copy link
Contributor

jreback commented Nov 30, 2017

thanks @ekisslinger

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby over a CategoricalIndex in axis=1
4 participants