Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

Merged
merged 4 commits into from
Apr 29, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 29, 2014

closes #5610

@jreback jreback added this to the 0.14.0 milestone Apr 29, 2014
@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

@hayd

I reverted the ohlc stuff, as we don't have a test for this, what was the issue again?

note that the fixes here are only for the non-cython methods (incl mean/sum that work on mixed type), and head/tail/nth are taken care of by cumcount (not 100% sure)

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

the #6594 causing issues....maybe do later

the grouping axis
"""
self._set_selection_from_grouper()
return self._python_agg_general(lambda x: notnull(x).sum(axis=axis)).astype('int64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a much simpler way to solve the upcasting!

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

This looks like some nice cleaning. Good thing to move the ohlc thing, basically it was just to remove it / make it raise. a sep issue.

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

This is from 0.13.1 (but currently works the same in master)

I think this should return a MI with first level of [0,1](and 2nd level as given)

for as_index=False right?

In [1]: df = DataFrame([[1, 2, 'foo'], [1, nan, 'bar',], [3, nan, 'baz']], columns=['A', 'B','C'])

In [3]: df.groupby('A',as_index=False).describe()
Out[3]: 
        A   B
count   2   1
mean    1   2
std     0 NaN
min     1   2
25%     1   2
50%     1   2
75%     1   2
max     1   2
count   1   0
mean    3 NaN
std   NaN NaN
min     3 NaN
25%     3 NaN
50%     3 NaN
75%     3 NaN
max     3 NaN

[16 rows x 2 columns]

like this?

-> result = gni.describe()
(Pdb) p expected
          A   B
0 count   2   1
  mean    1   2
  std     0 NaN
  min     1   2
  25%     1   2
  50%     1   2
  75%     1   2
  max     1   2
1 count   1   0
  mean    3 NaN
  std   NaN NaN
  min     3 NaN
  25%     3 NaN
  50%     3 NaN
  75%     3 NaN
  max     3 NaN

[16 rows x 2 columns]

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

@hayd @jorisvandenbossche

I had to 'fix' the last one (because otherwise as_index=False is pretty much useless when you end up with a multi-index IMHO

this is from test_groupby_as_index_apply

In [1]:         df = DataFrame({'item_id': ['b', 'b', 'a', 'c', 'a', 'b'],
   ...:                         'user_id': [1,2,1,1,3,1],
   ...:                         'time': range(6)})

In [2]: 
In [3]: df.groupby('user_id').head()
Out[3]: 
  item_id  time  user_id
0       b     0        1
1       b     1        2
2       a     2        1
3       c     3        1
4       a     4        3
5       b     5        1

[6 rows x 3 columns]

In [4]: df.groupby('user_id',as_index=False).head()
Out[4]: 
  item_id  time  user_id
0       b     0        1
1       b     1        2
2       a     2        1
3       c     3        1
4       a     4        3
5       b     5        1

[6 rows x 3 columns]

In [5]: df.groupby('user_id').apply(lambda x: x.head(2))
Out[5]: 
          item_id  time  user_id
user_id                         
1       0       b     0        1
        2       a     2        1
2       1       b     1        2
3       4       a     4        3

[4 rows x 3 columns]

In [6]: df.groupby('user_id',as_index=False).apply(lambda x: x.head(2))
Out[6]: 
    item_id  time  user_id
0 0       b     0        1
  2       a     2        1
1 1       b     1        2
2 4       a     4        3

[4 rows x 3 columns]

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

sorry...error in the above (the apply should not be returing the user_id column)

this is a rabbit hole!

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

I was about to say "gaaargh", was staring at that example... no idea what should be correct behaviour!

...I'm sticking with "gaaargh"

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

@jreback head is not an aggregation, it always ignores as_index.... (update: oh dear I see the weirdness is with apply) update2: this is a super edge case when function returns same column as the index name... so I think the apply head thing is ok/"correct" actually...

"gaaargh" is to the describe behaviour.

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

So as a corallary of having apply NOT returning the grouped column (if its named)

the following is also true (in 0.13.1, this returned the A column as well)

In [7]: df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()})

In [8]: grouper = df['A'].apply(lambda x: x % 2)

In [9]: grouped = df.groupby(grouper)

In [10]: grouped.filter(lambda x: x['A'].sum() > 10)
Out[10]: 
   B
1  b
2  c

[2 rows x 1 columns]

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

So this is another 'edgeish' case, and fixing is related to the above.

In [1]:         df = DataFrame({'foo1' : ['one', 'two', 'two', 'three', 'one', 'two'],
   ...:                         'foo2' : np.random.randn(6)})

In [2]: df
Out[2]: 
    foo1      foo2
0    one  1.006666
1    two  0.002063
2    two  1.507785
3  three  1.865921
4    one  0.141202
5    two -1.079792

[6 rows x 2 columns]

In [3]: df.groupby('foo1').mean()
Out[3]: 
           foo2
foo1           
one    0.573934
three  1.865921
two    0.143352

[3 rows x 1 columns]

In [4]: df.groupby('foo1').apply(lambda x: x.mean())
Out[4]: 
           foo2
foo1           
one    0.573934
three  1.865921
two    0.143352

[3 rows x 1 columns]

In [6]: df.groupby('foo1',as_index=False).apply(lambda x: x.mean())
Out[6]: 
       foo2
0  0.573934
1  1.865921
2  0.143352

[3 rows x 1 columns]

In [7]: df.groupby('foo1',as_index=False).mean()
Out[7]: 
    foo1      foo2
0    one  0.573934
1  three  1.865921
2    two  0.143352

[3 rows x 2 columns]

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

All that said (in my last 2 comments)....

I don't think going to do this last apply consisistency fix, its simply too hard ATM.

still some cases on returing the group by column

e.g. if I have Series([1,1,1,1],index=[0,1,2,3],name='foo') then this will remove the 'foo' column from the output. (if you group it on a frame).

to complicated

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

ok...going to merge....

@hayd you are on!

jreback added a commit that referenced this pull request Apr 29, 2014
ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned
@jreback jreback merged commit d2ead2c into pandas-dev:master Apr 29, 2014
@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

the filter thing was a bug in 0.13.1 which we fixed IIRC

the rest looks ok?

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

is filter supposed to return the grouped column? (its easy to make it compute with it); but should it return it?

@hayd
Copy link
Contributor

hayd commented Apr 29, 2014

yes it should return it, as filter is not an aggregation (like we updated head to be).

IIRC we tested that it respected "sub-group" (??) g[['B']].filter and g['B'].filter ? :s

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

ok...it passes all that

well, check out v0.14.0 once this builds

everything makes sense, however sometimes an apply can return the column ; i'll open an issue

@cpcloud
Copy link
Member

cpcloud commented Apr 29, 2014

i realize i'm a little late to the game, but is there any reason group_count in lib.pyx isn't being used to perform count?

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

hah!

I think that is some old code, not being used anywhere

so do you want to do:

delete group_count from lib.pyx
add group_count_<dtype> to generated.py (should be simple) and replace count?

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

#7003

related is: #4095

@cpcloud
Copy link
Member

cpcloud commented Apr 29, 2014

sure. just ran into a perf issue trying to count 700k groups :(

@jreback
Copy link
Contributor Author

jreback commented Apr 29, 2014

hah!

only tricky thing is this needs a different temple (use the take template) as it can accept ALL dtypes

and possibly some logic to get around the is_numeric stuff

so prob need a test to go thru a bunch of dtypes and count stuff

@jorisvandenbossche
Copy link
Member

Looking at this discussion with a lot of Aaarghs (and also seeing some of the other issues with groupby), I was thinking: should we try to write down some 'design document' where we try to describe the 'rules'.

Eg the "we regard head as a filtering-like function (no aggregation), and those always ignore as_index/return the grouped values in their original column"-rule as used above.

This could maybe clarify some things for ourselves, and can be used as a reference for future PRs. (Or could it also turn out as a rabbit hole ..)

-> moved to #5755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy.count() returns the grouping column as both index and column
4 participants