Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The groupby filter output differs if we run an aggregate function before #17091

Closed
arnaudlegout opened this issue Jul 27, 2017 · 5 comments · Fixed by #41674
Closed

The groupby filter output differs if we run an aggregate function before #17091

arnaudlegout opened this issue Jul 27, 2017 · 5 comments · Fixed by #41674
Labels
Bug good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@arnaudlegout
Copy link
Contributor

Code Sample, a copy-pastable example if possible

# I create a test DataFrame
d = pd.DataFrame({'data': range(6), 'key': list('ABCABC')})
# I groupby the column 'key'
g = d.groupby('key')
# I filter with always True (not that useful, just for the example)
print(g.filter(lambda x: True))
g.sum()
print(g.filter(lambda x: True))

Problem description

Here is the output of the above code

   data key
0     0   A
1     1   B
2     2   C
3     3   A
4     4   B
5     5   C
   data
0     0
1     1
2     2
3     3
4     4
5     5

I don't understand why the column key is in the output in the first run of filter, whereas when running an aggregate function (here g.sum()) before the filter, the key column disappear. If I use the as_index=False for the groupby, then the column is correctly preserved (as expected).

It looks like the aggregate function somehow change the groupby object, whereas my understanding of the groupby object is that each function call return a new object (and do not modify the original groupby object).

Expected Output

   data key
0     0   A
1     1   B
2     2   C
3     3   A
4     4   B
5     5   C

   data key
0     0   A
1     1   B
2     2   C
3     3   A
4     4   B
5     5   C

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

Hmm...that's really odd. Surprise inplace modifications aren't nice. PR to patch this is welcome!

@gfyoung gfyoung added the Bug label Jul 27, 2017
@ghost
Copy link

ghost commented Sep 21, 2017

@gfyoung and @arnaudlegout I ran into the same problem, and did some investigation on this. Turns out that when calling g.sum(), the member _group_selection in g is modified in the function _set_group_selection, but never got changed back.

Also, test_grouper_creation_bug in file test_groupby.py is inaccurate:

        df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]})
        g = df.groupby('A')
        expected = g.sum()

        g = df.groupby(pd.Grouper(key='A'))
        result = g.sum()
        assert_frame_equal(result, expected)

        result = g.apply(lambda x: x.sum())
        assert_frame_equal(result, expected)

If you do this:

        df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]})
        g = df.groupby('A')
        expected = g.sum()

        g = df.groupby(pd.Grouper(key='A'))
        result = g.sum()
        assert_frame_equal(result, expected)

        df = DataFrame({'A': [0, 0, 1, 1, 2, 2], 'B': [1, 2, 3, 4, 5, 6]})
        g = df.groupby('A')
        result = g.apply(lambda x: x.sum())
        assert_frame_equal(result, expected)

The test will fail. It only passes because of this bug. g.apply(lambda x: x.sum()) returns a 3x2 data frame if g is 3x2. This bug causes b to be 3x1, hence result turns to have the same dimension as expected.

Am I missing anything here? Many thanks.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

so groupers can keep state, though they really shouldn't. this is might be a wider problem. would be helpful for you to look at it.

@jreback jreback added this to the Next Major Release milestone Sep 22, 2017
@ghost
Copy link

ghost commented Sep 23, 2017

I'll be happy to!

@rhshadrach
Copy link
Member

rhshadrach commented Aug 26, 2020

This is essentially a duplicate of #34656, which was closed by #35314. I've confirmed that this is fixed on master. However, I think tests should be added using filter, similar to the tests that were added as part of #35314.

@rhshadrach rhshadrach added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Aug 26, 2020
@mroeschke mroeschke mentioned this issue May 26, 2021
8 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants