Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy.count() returns the grouping column as both index and column #5610

Closed
jorisvandenbossche opened this issue Nov 28, 2013 · 17 comments · Fixed by #7000
Closed

GroupBy.count() returns the grouping column as both index and column #5610

jorisvandenbossche opened this issue Nov 28, 2013 · 17 comments · Fixed by #7000
Milestone

Comments

@jorisvandenbossche
Copy link
Member

GroupBy.count() (with the default as_index=True) return the grouping column both as index and as column, while other methods as first and sum keep it only as the index (which is most logical I think). This seems a minor inconsistency to me:

In [41]: data = pd.DataFrame({'name' : ['a', 'a', 'b', 'd'], 'counts' : [3,4,3,2]})
In [42]: data
Out[42]:
   counts name
0       3    a
1       4    a
2       3    b
3       2    d

In [43]: g = data.groupby('name')
In [45]: g.count()
Out[45]:
      counts  name
name
a          2     2
b          1     1
d          1     1

In [46]: g.first()
Out[46]:
      counts
name
a          3
b          3
d          2

In [47]: g.sum()
Out[47]:
      counts
name
a          7
b          3
d          2
@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

hmm....

@hayd @TomAugspurger

these look right? are first/last just different?

In [2]: df= pd.DataFrame({'name' : ['a', 'a', 'b', 'd'], 'counts' : [3,4,3,2]})

In [3]: g = df.groupby('name')

In [4]: g.count()
Out[4]: 
      counts  name
name              
a          2     2
b          1     1
d          1     1

[3 rows x 2 columns]

In [5]: g.first()
Out[5]: 
      counts
name        
a          3
b          3
d          2

[3 rows x 1 columns]

In [6]: g.head()
Out[6]: 
   counts name
0       3    a
1       4    a
2       3    b
3       2    d

[4 rows x 2 columns]

In [7]: g.tail()
Out[7]: 
   counts name
0       3    a
1       4    a
2       3    b
3       2    d

[4 rows x 2 columns]

In [8]: g.last()
Out[8]: 
      counts
name        
a          4
b          3
d          2

[3 rows x 1 columns]

@jreback jreback added this to the 0.14.0 milestone Mar 22, 2014
@hayd
Copy link
Contributor

hayd commented Mar 22, 2014

IMO first should do the same as g.nth(0) and last as g.nth(-1), since as mentioned in the larger PR they are not aggregations (I think breaking these are on the roadmap for 0.14?). First and last are implemented as aggs atm.

Original issue is that count includes name, I also don't think it should. Will have a look at this, may be simple fix. Related to cumsum etc. including the grouped by columns (so may be a generic fix in agg).

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

hmm...though you already redefined first/last to be nth's...oh well...that sounds right to me

@hayd
Copy link
Contributor

hayd commented Mar 24, 2014

duh, count's applying when it should be agg-ing. PR shortly.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

http://stackoverflow.com/questions/23352418/unexpected-behavior-in-pandas-mad-with-groupby/23352706#23352706

mad is the same problem (as are prob the non-cythonized calls).

@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

@hayd you are doing a PR for this one? or is it already out there but missing a reference?

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

apparently I did on my machine, rebased and pushed will see if it passes... mañana

@jreback jreback added Bug and removed API Design labels Apr 29, 2014
@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

gr8!

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

These seem completely wrong (I haven't changed anything yet to exlucde the 'A' column
when testing these)

In [1]: df = DataFrame([[1, 2, 'foo'], [1, nan, 'bar',], [3, nan, 'baz']], columns=['A', 'B','C'])

In [2]: df
Out[2]: 
   A   B    C
0  1   2  foo
1  1 NaN  bar
2  3 NaN  baz

[3 rows x 3 columns]

In [3]: df.groupby('A').shift(1)
Out[3]: 
    A   B    C
0 NaN NaN  NaN
1   1   2  foo
2 NaN NaN  NaN

[3 rows x 3 columns]

In [4]: df.groupby('A').fillna(-1)
Out[4]: 
   A  B    C
0  1  2  foo
1  1 -1  bar
2  3 -1  baz

[3 rows x 3 columns]

In [5]: df.groupby('A').apply(lambda x: x.fillna(-1))
Out[5]: 
   A  B    C
0  1  2  foo
1  1 -1  bar
2  3 -1  baz

[3 rows x 3 columns]

@jorisvandenbossche
Copy link
Member Author

What is wrong with these?

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

they are not grouping (indices should be 1, 3)

@jorisvandenbossche
Copy link
Member Author

but it are non-aggregating functions?

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

hmm...you are right, then these should raise (rather than silenty 'work').

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

Is fillna in the white list?

I think shift is correct here though, it is shifting within the groups.

@jorisvandenbossche
Copy link
Member Author

Why should they raise? Eg df.groupby('A').shift(1) seems correct to me (shift within each group).
It's a bit like the transform functions?

The use of fillna is less clear, but it seems this is added explicitely in the whitelist.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

I guess these do effectively a transform by default (which is ok)

@JayDDD
Copy link

JayDDD commented Aug 26, 2016

How could I count the number of 'a' under the column name? I only need this number. what command should I use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants