Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enabled skipna argument on groupby reduction ops #15675 #58844

Closed

Conversation

andremcorreia
Copy link
Contributor

Added a skipna argurment to the groupby reduction ops for consistency with the Series and Dataframe variants:

  • sum,
  • prod,
  • min,
  • max,
  • mean,
  • median,
  • var,
  • std,
  • sem

Added new relevant tests, updated api tests and whatsnew

Added a skipna argurment to the groupby reduction ops:
  sum, prod, min, max, mean, median, var, std and sem
Added relevant tests
Updated whatsnew to reflect changes

Co-authored-by: Tiago Firmino <[email protected]>
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only took a quick look, but overall this is looking good. Can you also add tests for EAs (the nullable and pyarrow dtypes).

andremcorreia and others added 2 commits June 2, 2024 20:39
Co-authored-by: Tiago Firmino <[email protected]>
Co-authored-by: André Correia <[email protected]>
@tiago-firmino tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from fa11256 to 5cd994c Compare June 2, 2024 19:44
tiago-firmino and others added 2 commits June 2, 2024 21:01
Co-authored-by: André Correia <[email protected]>
Co-authored-by: Tiago Firmino <[email protected]>
@andremcorreia andremcorreia force-pushed the add_skipna_on_groupby_ops_pr branch from 788768a to 5e3a965 Compare June 2, 2024 22:16
@mroeschke mroeschke requested a review from rhshadrach June 3, 2024 18:31
@andremcorreia
Copy link
Contributor Author

andremcorreia commented Jun 11, 2024

Hi, we refactored the tests as requested and were working on adding the EAs. We found and fixed a few sneaky edge case bugs with these, but we ran into a problem with dtype=pd.ArrowDtype(pa.int64()).

The current implementation with EAs creates a typed numpy array and no NA value can be directly used in place for integers, we don't really see a path forward without considerable changes for this particular dtype.

We could add arrow floats easily, but we felt like only having a few arrow dtypes supported doesn't make much sense. 

How should we proceed?

@rhshadrach

@rhshadrach
Copy link
Member

rhshadrach commented Jun 12, 2024

Since the current implementation with arrows requires creating a typed numpy array and no NA value can be directly used in place for integers, we don't really see a path forward without considerable changes for this particular dtype.

The groupby methods implemented in Cython use mask and result_mask arguments for this purpose. This indicates where NA values are in the input and output respectively, so that they don't need to be in the NumPy array. If this isn't clear I can flesh it out some more, just ask.

I can also open up a PR into your branch if you want some assistance with the EAs here.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once EAs are implemented, we'll also want a test for them in tests/extension/base/reduce.py.

Co-authored-by: André Correia <[email protected]>
@tiago-firmino tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from 558bc25 to 2856c6d Compare June 14, 2024 22:34
@tiago-firmino
Copy link

Hello, after some rethinking about how we were handling EAs, we believe we've got it done correctly now since we were overthinking it before. However, we noticed the relevant array types tend to use pd.NA whilst the Series variants we are using to compare for our expected values are inferring np.nan.
We would like to know what to do in this situation, that is, what should be the expected value in EAs case, np.nan or pd.NA.

Our way of testing it was as such:


df = DataFrame({"key": [1, 1, 1, 2, 2, 2], "values": Series(pd.array([-1.0, 1.2, -1.1, 1.5, np.nan, 1.0], dtype="Float64"))})
gb = df.groupby("key")
result_cython = getattr(gb, reduction_method)(skipna=False)
expected = gb.apply(
        lambda x: getattr(x, reduction_method)(skipna=False), include_groups=False
)
tm.assert_frame_equal(result_cython, expected, check_exact=False, check_dtype=False)

And the AssertionError we get is:

[-0.9000000000000001, <NA>]
E   Length: 2, dtype: Float64
E   [right]: [-0.9000000000000001, nan]
E   At positional index 1, first diff: <NA> != nan

In relation to the tests in tests/extension/base/reduce.py, we would appreciate if some help with understanding how the tests work, how they are called and where the arguments come from. We understood they are the foundation for other tests used in the tests/extension directory but some fields left us confused.

@rhshadrach
Copy link
Member

tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from 558bc25 to 2856c6d

@tiago-firmino - each time you force push, reviewers must review the entire PR as history could be changing. On small PRs, this isn't much of a problem, but this is not a small PR. Can I ask that you no longer force push here?

@rhshadrach
Copy link
Member

However, we noticed the relevant array types tend to use pd.NA whilst the Series variants we are using to compare for our expected values are inferring np.nan.

Indeed, apply here looks to be producing the wrong result.

In relation to the tests in tests/extension/base/reduce.py, we would appreciate if some help with understanding how the tests work, how they are called and where the arguments come from.

For this PR, the relevant tests are in tests/extension/base/reduce.py in the class BaseReduceTests. In tests/extension/base/__init__.py, the class ExtensionTests is then a subclass of this, so inherits its methods. pytest runs the files e.g. tests/extension/test_arrow.py where there is the class TestArrowArray(base.ExtensionTests). This class has all the methods of BaseReduceTests, so the tests are run. Within test_arrow.py are fixtures such as data which setup data specifically for the pyarrow tests. Likewise, in test_masked.py, the fixture data sets up data specifically for the NumPy-nullable tests.

@tiago-firmino
Copy link

tiago-firmino force-pushed the add_skipna_on_groupby_ops_pr branch from 558bc25 to 2856c6d

@tiago-firmino - each time you force push, reviewers must review the entire PR as history could be changing. On small PRs, this isn't much of a problem, but this is not a small PR. Can I ask that you no longer force push here?

I'm really sorry, I was not aware of that and will no longer do it.

@andremcorreia
Copy link
Contributor Author

Hello,
We've had a bit less free time lately, hence the delay, but we've run into some problems that we're not sure how to address.
The main issue is with the prod function using int8, int16, and int32. Locally, we can't replicate the errors that show up in the pipeline, but we were able to identify that the wrong value is in the expected output, not in our result. However, after spending a long time looking at this issue, neither of us has been able to find a solution.
Do you have any suggestions on how we could tackle this?

@rhshadrach
Copy link
Member

@andremcorreia - I should be able to take a look in the next few days.

@rhshadrach
Copy link
Member

@andremcorreia

but we were able to identify that the wrong value is in the expected output, not in our result.

It looks to me like the result is overflowing in 64-bit precision, the expected in 32-bit precision. On Windows, this will be fixed by NumPy 2.0. In any case, this isn't an issue with this PR. Can you add the following to the corresponding test:

if op_name == "prod" and skipna and data.dtype.itemsize < 8 and np.intp().itemsize < 8:
    pytest.xfail(reason=f"{op_name} with itemsize {data.dtype.itemsize} overflows")

Copy link
Contributor

github-actions bot commented Sep 4, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 4, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: enable skipna on groupby reduction ops
4 participants