The groupby filter output differs if we run an aggregate function before #17091

arnaudlegout · 2017-07-27T06:21:06Z

Code Sample, a copy-pastable example if possible

# I create a test DataFrame
d = pd.DataFrame({'data': range(6), 'key': list('ABCABC')})
# I groupby the column 'key'
g = d.groupby('key')
# I filter with always True (not that useful, just for the example)
print(g.filter(lambda x: True))
g.sum()
print(g.filter(lambda x: True))

Problem description

Here is the output of the above code

   data key
0     0   A
1     1   B
2     2   C
3     3   A
4     4   B
5     5   C
   data
0     0
1     1
2     2
3     3
4     4
5     5

I don't understand why the column key is in the output in the first run of filter, whereas when running an aggregate function (here g.sum()) before the filter, the key column disappear. If I use the as_index=False for the groupby, then the column is correctly preserved (as expected).

It looks like the aggregate function somehow change the groupby object, whereas my understanding of the groupby object is that each function call return a new object (and do not modify the original groupby object).

Expected Output

   data key
0     0   A
1     1   B
2     2   C
3     3   A
4     4   B
5     5   C

   data key
0     0   A
1     1   B
2     2   C
3     3   A
4     4   B
5     5   C

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-27T06:49:38Z

Hmm...that's really odd. Surprise inplace modifications aren't nice. PR to patch this is welcome!

ghost · 2017-09-21T12:58:15Z

@gfyoung and @arnaudlegout I ran into the same problem, and did some investigation on this. Turns out that when calling g.sum(), the member _group_selection in g is modified in the function _set_group_selection, but never got changed back.

Also, test_grouper_creation_bug in file test_groupby.py is inaccurate:

        df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]})
        g = df.groupby('A')
        expected = g.sum()

        g = df.groupby(pd.Grouper(key='A'))
        result = g.sum()
        assert_frame_equal(result, expected)

        result = g.apply(lambda x: x.sum())
        assert_frame_equal(result, expected)

If you do this:

        df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]})
        g = df.groupby('A')
        expected = g.sum()

        g = df.groupby(pd.Grouper(key='A'))
        result = g.sum()
        assert_frame_equal(result, expected)

        df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]})
        g = df.groupby('A')
        result = g.apply(lambda x: x.sum())
        assert_frame_equal(result, expected)

The test will fail. It only passes because of this bug. g.apply(lambda x: x.sum()) returns a 3x2 data frame if g is 3x2. This bug causes b to be 3x1, hence result turns to have the same dimension as expected.

Am I missing anything here? Many thanks.

jreback · 2017-09-22T13:49:37Z

so groupers can keep state, though they really shouldn't. this is might be a wider problem. would be helpful for you to look at it.

ghost · 2017-09-23T00:58:34Z

I'll be happy to!

rhshadrach · 2020-08-26T04:21:22Z

This is essentially a duplicate of #34656, which was closed by #35314. I've confirmed that this is fixed on master. However, I think tests should be added using filter, similar to the tests that were added as part of #35314.

gfyoung added the Groupby label Jul 27, 2017

gfyoung added the Bug label Jul 27, 2017

jreback added Difficulty Intermediate labels Sep 22, 2017

jreback added this to the Next Major Release milestone Sep 22, 2017

WillAyd mentioned this issue Oct 7, 2019

BUG: Groupby selection context not being properly reset #28541

Closed

5 tasks

jbrockmendel removed Effort Medium labels Oct 21, 2019

rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Aug 26, 2020

mroeschke mentioned this issue May 26, 2021

TST: Old Issues #41674

Merged

8 tasks

jreback modified the milestones: Contributions Welcome, 1.3 May 26, 2021

jreback closed this as completed in #41674 May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The groupby filter output differs if we run an aggregate function before #17091

The groupby filter output differs if we run an aggregate function before #17091

arnaudlegout commented Jul 27, 2017

gfyoung commented Jul 27, 2017 •

edited

Loading

ghost commented Sep 21, 2017

jreback commented Sep 22, 2017

ghost commented Sep 23, 2017

rhshadrach commented Aug 26, 2020 •

edited

Loading

The groupby filter output differs if we run an aggregate function before #17091

The groupby filter output differs if we run an aggregate function before #17091

Comments

arnaudlegout commented Jul 27, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

gfyoung commented Jul 27, 2017 • edited Loading

ghost commented Sep 21, 2017

jreback commented Sep 22, 2017

ghost commented Sep 23, 2017

rhshadrach commented Aug 26, 2020 • edited Loading

Output of `pd.show_versions()`

gfyoung commented Jul 27, 2017 •

edited

Loading

rhshadrach commented Aug 26, 2020 •

edited

Loading