ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

jreback · 2014-04-29T15:30:07Z

jreback · 2014-04-29T15:31:32Z

I reverted the ohlc stuff, as we don't have a test for this, what was the issue again?

note that the fixes here are only for the non-cython methods (incl mean/sum that work on mixed type), and head/tail/nth are taken care of by cumcount (not 100% sure)

jreback · 2014-04-29T16:20:18Z

the #6594 causing issues....maybe do later

hayd · 2014-04-29T17:15:12Z

pandas/core/groupby.py

+               the grouping axis
+        """
+        self._set_selection_from_grouper()
+        return self._python_agg_general(lambda x: notnull(x).sum(axis=axis)).astype('int64')


a much simpler way to solve the upcasting!

hayd · 2014-04-29T17:20:48Z

This looks like some nice cleaning. Good thing to move the ohlc thing, basically it was just to remove it / make it raise. a sep issue.

jreback · 2014-04-29T18:19:37Z

This is from 0.13.1 (but currently works the same in master)

I think this should return a MI with first level of [0,1](and 2nd level as given)

for as_index=False right?

In [1]: df = DataFrame([[1, 2, 'foo'], [1, nan, 'bar',], [3, nan, 'baz']], columns=['A', 'B','C'])

In [3]: df.groupby('A',as_index=False).describe()
Out[3]: 
        A   B
count   2   1
mean    1   2
std     0 NaN
min     1   2
25%     1   2
50%     1   2
75%     1   2
max     1   2
count   1   0
mean    3 NaN
std   NaN NaN
min     3 NaN
25%     3 NaN
50%     3 NaN
75%     3 NaN
max     3 NaN

[16 rows x 2 columns]

like this?

-> result = gni.describe()
(Pdb) p expected
          A   B
0 count   2   1
  mean    1   2
  std     0 NaN
  min     1   2
  25%     1   2
  50%     1   2
  75%     1   2
  max     1   2
1 count   1   0
  mean    3 NaN
  std   NaN NaN
  min     3 NaN
  25%     3 NaN
  50%     3 NaN
  75%     3 NaN
  max     3 NaN

[16 rows x 2 columns]

CLN: remove __inv__, __neg__ from series and use generic version CLN: remove __wrap_array__ from generic (replace with __array_wrap__)

jreback · 2014-04-29T18:50:30Z

@hayd @jorisvandenbossche

I had to 'fix' the last one (because otherwise as_index=False is pretty much useless when you end up with a multi-index IMHO

this is from test_groupby_as_index_apply

In [1]:         df = DataFrame({'item_id': ['b', 'b', 'a', 'c', 'a', 'b'],
   ...:                         'user_id': [1,2,1,1,3,1],
   ...:                         'time': range(6)})

In [2]: 
In [3]: df.groupby('user_id').head()
Out[3]: 
  item_id  time  user_id
0       b     0        1
1       b     1        2
2       a     2        1
3       c     3        1
4       a     4        3
5       b     5        1

[6 rows x 3 columns]

In [4]: df.groupby('user_id',as_index=False).head()
Out[4]: 
  item_id  time  user_id
0       b     0        1
1       b     1        2
2       a     2        1
3       c     3        1
4       a     4        3
5       b     5        1

[6 rows x 3 columns]

In [5]: df.groupby('user_id').apply(lambda x: x.head(2))
Out[5]: 
          item_id  time  user_id
user_id                         
1       0       b     0        1
        2       a     2        1
2       1       b     1        2
3       4       a     4        3

[4 rows x 3 columns]

In [6]: df.groupby('user_id',as_index=False).apply(lambda x: x.head(2))
Out[6]: 
    item_id  time  user_id
0 0       b     0        1
  2       a     2        1
1 1       b     1        2
2 4       a     4        3

[4 rows x 3 columns]

jreback · 2014-04-29T18:55:04Z

sorry...error in the above (the apply should not be returing the user_id column)

this is a rabbit hole!

hayd · 2014-04-29T19:07:36Z

I was about to say "gaaargh", was staring at that example... no idea what should be correct behaviour!

...I'm sticking with "gaaargh"

hayd · 2014-04-29T19:12:59Z

@jreback head is not an aggregation, it always ignores as_index.... ~~(update: oh dear I see the weirdness is with apply)~~ update2: this is a super edge case when function returns same column as the index name... so I think the apply head thing is ok/"correct" actually...

"gaaargh" is to the describe behaviour.

jreback · 2014-04-29T19:44:30Z

So as a corallary of having apply NOT returning the grouped column (if its named)

the following is also true (in 0.13.1, this returned the A column as well)

In [7]: df = pd.DataFrame({'A': [1, 12, 12, 1], 'B': 'a b c d'.split()})

In [8]: grouper = df['A'].apply(lambda x: x % 2)

In [9]: grouped = df.groupby(grouper)

In [10]: grouped.filter(lambda x: x['A'].sum() > 10)
Out[10]: 
   B
1  b
2  c

[2 rows x 1 columns]

jreback · 2014-04-29T20:08:22Z

So this is another 'edgeish' case, and fixing is related to the above.

In [1]:         df = DataFrame({'foo1' : ['one', 'two', 'two', 'three', 'one', 'two'],
   ...:                         'foo2' : np.random.randn(6)})

In [2]: df
Out[2]: 
    foo1      foo2
0    one  1.006666
1    two  0.002063
2    two  1.507785
3  three  1.865921
4    one  0.141202
5    two -1.079792

[6 rows x 2 columns]

In [3]: df.groupby('foo1').mean()
Out[3]: 
           foo2
foo1           
one    0.573934
three  1.865921
two    0.143352

[3 rows x 1 columns]

In [4]: df.groupby('foo1').apply(lambda x: x.mean())
Out[4]: 
           foo2
foo1           
one    0.573934
three  1.865921
two    0.143352

[3 rows x 1 columns]

In [6]: df.groupby('foo1',as_index=False).apply(lambda x: x.mean())
Out[6]: 
       foo2
0  0.573934
1  1.865921
2  0.143352

[3 rows x 1 columns]

In [7]: df.groupby('foo1',as_index=False).mean()
Out[7]: 
    foo1      foo2
0    one  0.573934
1  three  1.865921
2    two  0.143352

[3 rows x 2 columns]

jreback · 2014-04-29T20:10:44Z

All that said (in my last 2 comments)....

I don't think going to do this last apply consisistency fix, its simply too hard ATM.

still some cases on returing the group by column

e.g. if I have Series([1,1,1,1],index=[0,1,2,3],name='foo') then this will remove the 'foo' column from the output. (if you group it on a frame).

to complicated

jreback · 2014-04-29T20:12:45Z

ok...going to merge....

@hayd you are on!

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned

hayd · 2014-04-29T20:12:51Z

the filter thing was a bug in 0.13.1 which we fixed IIRC

the rest looks ok?

jreback · 2014-04-29T20:14:50Z

is filter supposed to return the grouped column? (its easy to make it compute with it); but should it return it?

hayd · 2014-04-29T20:23:53Z

yes it should return it, as filter is not an aggregation (like we updated head to be).

IIRC we tested that it respected "sub-group" (??) g[['B']].filter and g['B'].filter ? :s

jreback · 2014-04-29T20:26:07Z

ok...it passes all that

well, check out v0.14.0 once this builds

everything makes sense, however sometimes an apply can return the column ; i'll open an issue

cpcloud · 2014-04-29T20:30:21Z

i realize i'm a little late to the game, but is there any reason group_count in lib.pyx isn't being used to perform count?

jreback · 2014-04-29T20:36:42Z

hah!

I think that is some old code, not being used anywhere

so do you want to do:

delete group_count from lib.pyx
add group_count_<dtype> to generated.py (should be simple) and replace count?

jreback · 2014-04-29T20:37:21Z

#7003

related is: #4095

cpcloud · 2014-04-29T20:40:57Z

sure. just ran into a perf issue trying to count 700k groups :(

jreback · 2014-04-29T20:42:18Z

hah!

only tricky thing is this needs a different temple (use the take template) as it can accept ALL dtypes

and possibly some logic to get around the is_numeric stuff

so prob need a test to go thru a bunch of dtypes and count stuff

jorisvandenbossche · 2014-04-29T21:31:48Z

Looking at this discussion with a lot of Aaarghs (and also seeing some of the other issues with groupby), I was thinking: should we try to write down some 'design document' where we try to describe the 'rules'.

Eg the "we regard head as a filtering-like function (no aggregation), and those always ignore as_index/return the grouped values in their original column"-rule as used above.

This could maybe clarify some things for ourselves, and can be used as a reference for future PRs. (Or could it also turn out as a rabbit hole ..)

-> moved to #5755

jreback added the Groupby label Apr 29, 2014

jreback added this to the 0.14.0 milestone Apr 29, 2014

jreback added Bug labels Apr 29, 2014

jreback mentioned this pull request Apr 29, 2014

make grouping column an agg #6997

Closed

hayd reviewed Apr 29, 2014
View reviewed changes

hayd and others added 4 commits April 29, 2014 14:46

ENH: add count method to groupby (GH5610)

ccab72d

ENH: infer selection_obj on groupby with an applied method (GH5610)

6fa398e

BUG: fixup Categorical.describe to work with 'fixed' count

f520e8d

CLN: remove __inv__, __neg__ from series and use generic version CLN: remove __wrap_array__ from generic (replace with __array_wrap__)

BUG: handle as_index=False for pseudo multi-groupers (e.g. .describe())

134dd1f

jreback added a commit that referenced this pull request Apr 29, 2014

Merge pull request #7000 from jreback/groupby_counts_agg

d2ead2c

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned

jreback merged commit d2ead2c into pandas-dev:master Apr 29, 2014

jreback mentioned this pull request Apr 29, 2014

BUG/TST: verify that groupby apply with a column aggregation does not return the column #7002

Closed

jreback mentioned this pull request Apr 29, 2014

PERF: cythonize groupby.count #7003

Closed

jorisvandenbossche mentioned this pull request Apr 29, 2014

Consistency with groupby as_index #5755

Closed

8 tasks

This was referenced May 1, 2014

cumsum sums the groupby column #5614

Closed

BUG: Applying function on column of Groupby object with as_index=False does not select column #5764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

cpcloud commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

cpcloud commented Apr 29, 2014

jreback commented Apr 29, 2014

jorisvandenbossche commented Apr 29, 2014

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

ENH/BUG: add count to grouper / ensure that grouper keys are not included in the returned #7000

Conversation

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd Apr 29, 2014

Choose a reason for hiding this comment

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

hayd commented Apr 29, 2014

jreback commented Apr 29, 2014

cpcloud commented Apr 29, 2014

jreback commented Apr 29, 2014

jreback commented Apr 29, 2014

cpcloud commented Apr 29, 2014

jreback commented Apr 29, 2014

jorisvandenbossche commented Apr 29, 2014