Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby with categorical drops empty groups when aggregating over a series #8870

Closed
2 tasks
aimboden opened this issue Nov 21, 2014 · 4 comments · Fixed by #30646
Closed
2 tasks

BUG: groupby with categorical drops empty groups when aggregating over a series #8870

aimboden opened this issue Nov 21, 2014 · 4 comments · Fixed by #30646
Labels
Categorical Categorical Data Type good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@aimboden
Copy link

  • Series groupby excluding NaN groups with Categorical (DataFrame DOES include)
  • sorting via a returned Interval-like-Index (string based)

Hello,

When grouping a DataFrame over more than one column including a categorical, the empty groups are kept in the aggregation result. A test for this behaviour was introduced in #8138.

However, when performing aggregation on only one column of the DataFrame, the empty groups are dropped. This seems inconsistent to me and I guess that it's an edge case that wasn't thought of at the time.

d = {'foo': [10, 8, 4, 1], 'bar': [10, 20, 30, 40],
     'baz': ['d', 'c', 'd', 'c']}
df = pd.DataFrame(d)
cat = pd.cut(df['foo'], np.linspace(0, 20, 5))
df['range'] = cat
groups = df.groupby(['range', 'baz'], as_index=True, sort=True)

# Expected result, fixed as part of #8138
fixed = groups.agg('mean')

# Inconsistent behaviour with series
inconsistent = groups['foo'].agg('mean')

# Expected result
expected = fixed['foo']
fixed
bar foo
range baz
(0, 5] c 1 40
d 4 30
(10, 15] c NaN NaN
d NaN NaN
(15, 20] c NaN NaN
d NaN NaN
(5, 10] c 8 20
d 10 10
inconsistent
range baz
(0, 5] c 1
d 4
(5, 10] c 8
d 10
expected
range baz
(0, 5] c 1
d 4
(10, 15] c NaN
d NaN
(15, 20] c NaN
d NaN
(5, 10] c 8
d 10

Note the strange ordering of the categorical index. I would expect sorted = True to sort by categorical level and not by lexical order?

Also note that using as_index=False fails due to #8869

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

This needs an IntervalIndex, see #8707; these are just strings for now.

@jreback jreback added API Design Bug Groupby Categorical Categorical Data Type Interval Interval data type labels Nov 21, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 21, 2014
@aimboden
Copy link
Author

I agree that the sorting part requires an IntervalIndex, but what about the dropping of empty groups? This was taken care of in the case of DataFrames, but still happens for Series.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

hmm, should be consistent with DataFrame and (either both inlclude the nan groups as Frame does now, or both exclude - was some debate over this).

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jorisvandenbossche jorisvandenbossche removed API Design Interval Interval data type labels Jul 29, 2015
@mroeschke
Copy link
Member

This looked fixed on master. Could use a test.

In [80]: inconsistent
Out[80]:
range         baz
(0.0, 5.0]    c       1.0
              d       4.0
(5.0, 10.0]   c       8.0
              d      10.0
(10.0, 15.0]  c       NaN
              d       NaN
(15.0, 20.0]  c       NaN
              d       NaN
Name: foo, dtype: float64

In [81]: expected
Out[81]:
range         baz
(0.0, 5.0]    c       1.0
              d       4.0
(5.0, 10.0]   c       8.0
              d      10.0
(10.0, 15.0]  c       NaN
              d       NaN
(15.0, 20.0]  c       NaN
              d       NaN
Name: foo, dtype: float64

In [82]: pd.__version__
Out[82]: '0.26.0.dev0+490.g9cfb8b55b'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Groupby labels Oct 6, 2019
@jbrockmendel jbrockmendel added Categorical Categorical Data Type Groupby labels Oct 16, 2019
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants