Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Multi grouper containing a categorical not dropped from index when using groupby with as_index=False #8869

Closed
aimboden opened this issue Nov 21, 2014 · 3 comments · Fixed by #20583
Labels
Bug Categorical Categorical Data Type Groupby

Comments

@aimboden
Copy link

Hello,

The following example definitely seems like a bug. The grouper is not dropped from the index of the resulting DataFrame, even when as_index = False.

Actually, even the aggregation step completely fails, so there may be more to it, as shown in this example.

import pandas as pd
d = {'foo': [10, 8, 4, 8, 4, 1, 1], 'bar': [10, 20, 30, 40, 50, 60, 70],
     'baz': ['d', 'c', 'e', 'a', 'a', 'd', 'c']}
df = pd.DataFrame(d)
cat = pd.cut(df['foo'], np.linspace(0, 10, 3))
df['range'] = cat
groups = df.groupby(['range', 'baz'], as_index=False)
result = groups.agg('mean')
result
range baz bar foo
range baz
(0, 5] a NaN NaN NaN NaN
c NaN NaN NaN NaN
d NaN NaN NaN NaN
e NaN NaN NaN NaN
(5, 10] a NaN NaN NaN NaN
c NaN NaN NaN NaN
d NaN NaN NaN NaN
e NaN NaN NaN NaN

Compare to the expected result:

groups2 = df.groupby(['range', 'baz'], as_index=True)
expected = groups2.agg('mean').reset_index()
expected
range baz bar foo
0 (0, 5] a 50 4
1 (0, 5] c 70 1
2 (0, 5] d 60 1
3 (0, 5] e 30 4
4 (5, 10] a 40 8
5 (5, 10] c 20 8
6 (5, 10] d 10 10
7 (5, 10] e NaN
pd.__version__
Out[181]: '0.15.1'
@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

This is a bug.

@behzadnouri
Copy link
Contributor

FYI this function is causing the problem. It disregards the value of DataFrameGroupBy.as_index ( or self.as_index inside the function code. )

it is also causing other issues:

In [43]: df
Out[43]:
   jim joe  jolie
0    0   a     84
1    1   b     23
2    2   c     25

In [44]: df.groupby(['jim', 'joe']).agg('mean')
Out[44]:
         jolie
jim joe
0   a       84
1   b       23
2   c       25

In [45]: df['joe'] = df['joe'].astype('category')

In [46]: df.groupby(['jim', 'joe']).agg('mean')
Out[46]:
         jolie
jim joe
0   a       84
    b      NaN
    c      NaN
1   a      NaN
    b       23
    c      NaN
2   a      NaN
    b      NaN
    c       25

also, because the column goes into MultiIndex at some point, category type is lost regardless of the value of as_index (does it matter?!):

In [47]: df.dtypes
Out[47]:
jim         int64
joe      category
jolie       int64
dtype: object

In [48]: df.groupby(['jim', 'joe']).agg('mean').reset_index().dtypes
Out[48]:
jim        int64
joe       object
jolie    float64
dtype: object

In [49]: df.groupby(['jim', 'joe'], as_index=False).agg('mean').dtypes
Out[49]:
jim       int64
joe      object
jolie     int64
dtype: object

@aimboden
Copy link
Author

aimboden commented Dec 2, 2014

@behzadnouri Thanks for looking into it.

This behaviour (keep the cartesian product) is expected when grouping with a categorical column, see #8138

In [43]: df
Out[43]:
   jim joe  jolie
0    0   a     84
1    1   b     23
2    2   c     25

In [44]: df.groupby(['jim', 'joe']).agg('mean')
Out[44]:
         jolie
jim joe
0   a       84
1   b       23
2   c       25

In [45]: df['joe'] = df['joe'].astype('category')

In [46]: df.groupby(['jim', 'joe']).agg('mean')
Out[46]:
         jolie
jim joe
0   a       84
    b      NaN
    c      NaN
1   a      NaN
    b       23
    c      NaN
2   a      NaN
    b      NaN
    c       25

As for your second point (losing categorical dtype), I don't think it makes a difference right now, but it could bite us when/if a categorical index #7629 is implemented. I would then expect the sort=True argument to sort the multi-index according to the categorical ordering, which is lost atm.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 9, 2018
@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants