-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support numeric_only
for simple groupby aggregations for pandas
2.0 compatibility
#9889
Support numeric_only
for simple groupby aggregations for pandas
2.0 compatibility
#9889
Conversation
There's one failure left to resolve,
|
The test failure is resolved by 8b877b5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @j-bennet! This will be nice to have -- looking forward to seeing it merged
…ception was raised.
dask/dataframe/_compat.py
Outdated
with warnings.catch_warnings(): | ||
warnings.filterwarnings( | ||
"ignore", | ||
message="The default value of numeric_only in", | ||
message="Dropping of nuisance columns", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's this coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas 1.3, there's this warning on non-numeric data with some of the aggs, and it's different from what it does in 1.5. We caught the 1.5 warning before, but not this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrbourbeau So we still need this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was one test where this is still needed. I'm inclined to just handle it as follow-up work
maybe_raise = not ( | ||
func.__name__ == "agg" | ||
and len(args) > 0 | ||
and args[0] not in NUMERIC_ONLY_NOT_IMPLEMENTED | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this is to catch when operations in NUMERIC_ONLY_NOT_IMPLEMENTED
are being used inside an agg(...)
call
dask/dataframe/groupby.py
Outdated
with warnings.catch_warnings(): | ||
warnings.filterwarnings( | ||
"ignore", | ||
message="In a future version, the Index constructor will not infer numeric dtypes", | ||
category=FutureWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to other changes in this PR or known flaky tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a flaky test happy.
assert_eq(ddf.groupby(ddf.w).y.nunique(), df.groupby(df.w).y.nunique()) | ||
assert_eq(ddf.y.groupby(ddf.w).count(), df.y.groupby(df.w).count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why these are indented. Are we emitting warnings now? I would have expected us to match warnings from pandas and, since pandas didn't appear to be warning before, I'm confused why we might be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas was warning before, we didn't. Now we do, have to catch it.
…heck warning behavior.
Co-authored-by: James Bourbeau <[email protected]>
The last failure with minimal dependencies does not seem to be related... possibly flaky? https://github.com/dask/dask/actions/runs/4079092927/jobs/7030092277
|
Hmm I've not seen those before. My guess is they're somehow related to the changes in this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out there were a couple of bugs in older versions of pandas
that weren't straightforward to workaround. I pushed a9e27d9 which just skips those specific configurations for now. I'll also push up a PR that bumps our minimum pandas
version (it's been a while since we've done that).
@@ -117,7 +117,38 @@ def test_concat_unions_categoricals(): | |||
tm.assert_frame_equal(_concat(frames5), pd.concat(frames6)) | |||
|
|||
|
|||
def test_unknown_categoricals(shuffle_method): | |||
# TODO: Remove the filterwarnings below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can we do this TODO?
numeric_only
for simple groupby aggregations for pandas
2.0 compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @j-bennet!
This PR adds `dtypes` property to `GroupBy`, this will also fix some upstream dask breaking changes introduced in: dask/dask#9889 Issue was discovered in: #12768 (comment) Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #12783
Partially implement
numeric_only
on GroupBy operations, to align the behavior with Pandas.This PR only includes changes for aggs that are using
_single_agg
internally. More complicated aggs will have to be handled separately.Xref #9736.
Xref #9471.
pre-commit run --all-files