ENH: Enabled skipna argument on groupby reduction ops #15675 #58844

andremcorreia · 2024-05-27T14:40:23Z

closes ENH: enable skipna on groupby reduction ops #15675
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file if fixing a bug or adding a new feature.

Added a skipna argurment to the groupby reduction ops for consistency with the Series and Dataframe variants:

sum,
prod,
min,
max,
mean,
median,
var,
std,
sem

Added new relevant tests, updated api tests and whatsnew

Added a skipna argurment to the groupby reduction ops: sum, prod, min, max, mean, median, var, std and sem Added relevant tests Updated whatsnew to reflect changes Co-authored-by: Tiago Firmino <[email protected]>

Co-authored-by: André Correia <[email protected]>

rhshadrach

Only took a quick look, but overall this is looking good. Can you also add tests for EAs (the nullable and pyarrow dtypes).

pandas/_libs/groupby.pyx

pandas/core/_numba/executor.py

pandas/core/_numba/kernels/sum_.py

pandas/core/resample.py

pandas/tests/groupby/test_numba.py

pandas/tests/groupby/test_reductions.py

Co-authored-by: Tiago Firmino <[email protected]>

Co-authored-by: André Correia <[email protected]>

Co-authored-by: Tiago Firmino <[email protected]>

pandas/core/_numba/kernels/sum_.py

pandas/tests/groupby/test_reductions.py

Co-authored-by: Tiago Firmino <[email protected]>

Co-authored-by: André Correia <[email protected]>

andremcorreia · 2024-06-11T02:12:24Z

Hi, we refactored the tests as requested and were working on adding the EAs. We found and fixed a few sneaky edge case bugs with these, but we ran into a problem with dtype=pd.ArrowDtype(pa.int64()).

The current implementation with EAs creates a typed numpy array and no NA value can be directly used in place for integers, we don't really see a path forward without considerable changes for this particular dtype.

We could add arrow floats easily, but we felt like only having a few arrow dtypes supported doesn't make much sense.

How should we proceed?

@rhshadrach

for more information, see https://pre-commit.ci

…ia/pandas into add_skipna_on_groupby_ops_pr

rhshadrach · 2024-06-12T21:21:53Z

Since the current implementation with arrows requires creating a typed numpy array and no NA value can be directly used in place for integers, we don't really see a path forward without considerable changes for this particular dtype.

The groupby methods implemented in Cython use mask and result_mask arguments for this purpose. This indicates where NA values are in the input and output respectively, so that they don't need to be in the NumPy array. If this isn't clear I can flesh it out some more, just ask.

I can also open up a PR into your branch if you want some assistance with the EAs here.

rhshadrach

Once EAs are implemented, we'll also want a test for them in tests/extension/base/reduce.py.

pandas/core/groupby/ops.py

Co-authored-by: André Correia <[email protected]>

tiago-firmino · 2024-06-14T23:03:04Z

Hello, after some rethinking about how we were handling EAs, we believe we've got it done correctly now since we were overthinking it before. However, we noticed the relevant array types tend to use pd.NA whilst the Series variants we are using to compare for our expected values are inferring np.nan.
We would like to know what to do in this situation, that is, what should be the expected value in EAs case, np.nan or pd.NA.

Our way of testing it was as such:


df = DataFrame({"key": [1, 1, 1, 2, 2, 2], "values": Series(pd.array([-1.0, 1.2, -1.1, 1.5, np.nan, 1.0], dtype="Float64"))})
gb = df.groupby("key")
result_cython = getattr(gb, reduction_method)(skipna=False)
expected = gb.apply(
        lambda x: getattr(x, reduction_method)(skipna=False), include_groups=False
)
tm.assert_frame_equal(result_cython, expected, check_exact=False, check_dtype=False)

And the AssertionError we get is:

[-0.9000000000000001, <NA>]
E   Length: 2, dtype: Float64
E   [right]: [-0.9000000000000001, nan]
E   At positional index 1, first diff: <NA> != nan

In relation to the tests in tests/extension/base/reduce.py, we would appreciate if some help with understanding how the tests work, how they are called and where the arguments come from. We understood they are the foundation for other tests used in the tests/extension directory but some fields left us confused.

rhshadrach · 2024-06-15T11:01:28Z

tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from 558bc25 to 2856c6d

@tiago-firmino - each time you force push, reviewers must review the entire PR as history could be changing. On small PRs, this isn't much of a problem, but this is not a small PR. Can I ask that you no longer force push here?

rhshadrach · 2024-06-15T11:12:43Z

However, we noticed the relevant array types tend to use pd.NA whilst the Series variants we are using to compare for our expected values are inferring np.nan.

Indeed, apply here looks to be producing the wrong result.

In relation to the tests in tests/extension/base/reduce.py, we would appreciate if some help with understanding how the tests work, how they are called and where the arguments come from.

For this PR, the relevant tests are in tests/extension/base/reduce.py in the class BaseReduceTests. In tests/extension/base/__init__.py, the class ExtensionTests is then a subclass of this, so inherits its methods. pytest runs the files e.g. tests/extension/test_arrow.py where there is the class TestArrowArray(base.ExtensionTests). This class has all the methods of BaseReduceTests, so the tests are run. Within test_arrow.py are fixtures such as data which setup data specifically for the pyarrow tests. Likewise, in test_masked.py, the fixture data sets up data specifically for the NumPy-nullable tests.

tiago-firmino · 2024-06-15T14:32:14Z

tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from 558bc25 to 2856c6d

@tiago-firmino - each time you force push, reviewers must review the entire PR as history could be changing. On small PRs, this isn't much of a problem, but this is not a small PR. Can I ask that you no longer force push here?

I'm really sorry, I was not aware of that and will no longer do it.

Co-authored-by: Tiago Firmino <[email protected]>

andremcorreia · 2024-07-04T17:02:10Z

Hello,
We've had a bit less free time lately, hence the delay, but we've run into some problems that we're not sure how to address.
The main issue is with the prod function using int8, int16, and int32. Locally, we can't replicate the errors that show up in the pipeline, but we were able to identify that the wrong value is in the expected output, not in our result. However, after spending a long time looking at this issue, neither of us has been able to find a solution.
Do you have any suggestions on how we could tackle this?

rhshadrach · 2024-07-14T11:40:50Z

@andremcorreia - I should be able to take a look in the next few days.

rhshadrach · 2024-07-21T12:31:43Z

@andremcorreia

but we were able to identify that the wrong value is in the expected output, not in our result.

It looks to me like the result is overflowing in 64-bit precision, the expected in 32-bit precision. On Windows, this will be fixed by NumPy 2.0. In any case, this isn't an issue with this PR. Can you add the following to the corresponding test:

if op_name == "prod" and skipna and data.dtype.itemsize < 8 and np.intp().itemsize < 8:
    pytest.xfail(reason=f"{op_name} with itemsize {data.dtype.itemsize} overflows")

Co-authored-by: André Correia <[email protected]>

github-actions · 2024-09-04T00:06:47Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-09-09T17:50:04Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

ENH: Enabled skipna argument on groupby reduction ops (pandas-dev#15675)

e13428b

Added a skipna argurment to the groupby reduction ops: sum, prod, min, max, mean, median, var, std and sem Added relevant tests Updated whatsnew to reflect changes Co-authored-by: Tiago Firmino <[email protected]>

andremcorreia requested review from rhshadrach and WillAyd as code owners May 27, 2024 14:40

tiago-firmino and others added 2 commits May 27, 2024 18:21

FIX: fixed pipeline issues related to docs and window tests

8099d84

Co-authored-by: André Correia <[email protected]>

Fix: pre-commit

8f61fda

rhshadrach requested changes May 29, 2024

View reviewed changes

mroeschke added the Groupby label May 31, 2024

andremcorreia and others added 2 commits June 2, 2024 20:39

Reworked sugestions

2518696

Co-authored-by: Tiago Firmino <[email protected]>

Reworked documentation

5cd994c

Co-authored-by: André Correia <[email protected]>

tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from fa11256 to 5cd994c Compare June 2, 2024 19:44

tiago-firmino and others added 2 commits June 2, 2024 21:01

FIX: resample redefinition

8ae0caf

Co-authored-by: André Correia <[email protected]>

FIX: Small tweaks in docs

5e3a965

Co-authored-by: Tiago Firmino <[email protected]>

andremcorreia force-pushed the add_skipna_on_groupby_ops_pr branch from 788768a to 5e3a965 Compare June 2, 2024 22:16

mroeschke requested a review from rhshadrach June 3, 2024 18:31

rhshadrach requested changes Jun 6, 2024

View reviewed changes

pandas/core/_numba/kernels/sum_.py Show resolved Hide resolved

pandas/tests/groupby/test_reductions.py Outdated Show resolved Hide resolved

andremcorreia and others added 3 commits June 9, 2024 19:10

Refactored test parameterization

c692076

Co-authored-by: Tiago Firmino <[email protected]>

Added tests for EAs

e87e030

Co-authored-by: André Correia <[email protected]>

Removed Arrow support

4f11dab

pre-commit-ci bot and others added 3 commits June 11, 2024 02:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

edbb331

for more information, see https://pre-commit.ci

pre-commit fix

3d719b8

Merge branch 'add_skipna_on_groupby_ops_pr' of github.com:andremcorre…

c2ceb57

…ia/pandas into add_skipna_on_groupby_ops_pr

rhshadrach reviewed Jun 12, 2024

View reviewed changes

pandas/core/groupby/ops.py Outdated Show resolved Hide resolved

WIP EAs support

2856c6d

Co-authored-by: André Correia <[email protected]>

tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from 558bc25 to 2856c6d Compare June 14, 2024 22:34

andremcorreia and others added 4 commits July 4, 2024 14:15

Extension Array Support Tests

c200177

Co-authored-by: Tiago Firmino <[email protected]>

Merge branch 'main' into add_skipna_on_groupby_ops_pr

66e0ee4

WIP: Fixing Tests

262ca97

Co-authored-by: Tiago Firmino <[email protected]>

WIP: 32bit fix

91bb3c3

tiago-firmino and others added 5 commits August 3, 2024 01:05

WIP: overflow

7ee07d1

Co-authored-by: André Correia <[email protected]>

Merge branch 'main' into add_skipna_on_groupby_ops_pr

bae5217

Fix tests 32bit

5a004cf

small tweaks

0ef070c

simpler test skipping approach

d02b308

github-actions bot added the Stale label Sep 4, 2024

mroeschke closed this Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Enabled skipna argument on groupby reduction ops #15675 #58844

ENH: Enabled skipna argument on groupby reduction ops #15675 #58844

andremcorreia commented May 27, 2024

rhshadrach left a comment

andremcorreia commented Jun 11, 2024 •

edited

Loading

rhshadrach commented Jun 12, 2024 •

edited

Loading

rhshadrach left a comment

tiago-firmino commented Jun 14, 2024

rhshadrach commented Jun 15, 2024

rhshadrach commented Jun 15, 2024

tiago-firmino commented Jun 15, 2024

andremcorreia commented Jul 4, 2024

rhshadrach commented Jul 14, 2024

rhshadrach commented Jul 21, 2024

github-actions bot commented Sep 4, 2024

mroeschke commented Sep 9, 2024

ENH: Enabled skipna argument on groupby reduction ops #15675 #58844

ENH: Enabled skipna argument on groupby reduction ops #15675 #58844

Conversation

andremcorreia commented May 27, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

andremcorreia commented Jun 11, 2024 • edited Loading

rhshadrach commented Jun 12, 2024 • edited Loading

rhshadrach left a comment

Choose a reason for hiding this comment

tiago-firmino commented Jun 14, 2024

rhshadrach commented Jun 15, 2024

rhshadrach commented Jun 15, 2024

tiago-firmino commented Jun 15, 2024

andremcorreia commented Jul 4, 2024

rhshadrach commented Jul 14, 2024

rhshadrach commented Jul 21, 2024

github-actions bot commented Sep 4, 2024

mroeschke commented Sep 9, 2024

andremcorreia commented Jun 11, 2024 •

edited

Loading

rhshadrach commented Jun 12, 2024 •

edited

Loading