Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support numeric_only for simple groupby aggregations for pandas 2.0 compatibility #9889

Merged
merged 41 commits into from
Feb 10, 2023

Conversation

j-bennet
Copy link
Contributor

@j-bennet j-bennet commented Jan 27, 2023

Partially implement numeric_only on GroupBy operations, to align the behavior with Pandas.

This PR only includes changes for aggs that are using _single_agg internally. More complicated aggs will have to be handled separately.

Xref #9736.
Xref #9471.

  • Tests added / passed
  • Passes pre-commit run --all-files

@j-bennet
Copy link
Contributor Author

There's one failure left to resolve, dask/dataframe/tests/test_groupby.py::test_groupby_numeric_only_supported fails with sum agg and numeric_only=False with pandas 1.5:

            elif isinstance(dsk, dd.DataFrame):
                assert "DataFrame" in type(result).__name__, type(result)
                assert isinstance(dsk.columns, pd.Index), type(dsk.columns)
                assert type(dsk._meta) == type(result), type(dsk._meta)
                if check_names:
>                   tm.assert_index_equal(dsk.columns, result.columns)
E                   AssertionError: Index are different
E                   
E                   Index length are different
E                   [left]:  2, Index(['B', 'C'], dtype='object')
E                   [right]: 1, Index(['B'], dtype='object')

../utils.py:494: AssertionError

@j-bennet
Copy link
Contributor Author

j-bennet commented Jan 27, 2023

The test failure is resolved by 8b877b5.

@j-bennet j-bennet changed the title [WIP] numeric_only support for simple aggs, pandas 2.0 compatibility numeric_only support for simple aggs, pandas 2.0 compatibility Jan 27, 2023
@j-bennet j-bennet requested a review from jrbourbeau January 27, 2023 21:10
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j-bennet! This will be nice to have -- looking forward to seeing it merged

dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Outdated Show resolved Hide resolved
dask/dataframe/groupby.py Show resolved Hide resolved
dask/dataframe/groupby.py Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
@j-bennet j-bennet requested a review from jrbourbeau February 1, 2023 21:55
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message="The default value of numeric_only in",
message="Dropping of nuisance columns",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's this coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas 1.3, there's this warning on non-numeric data with some of the aggs, and it's different from what it does in 1.5. We caught the 1.5 warning before, but not this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrbourbeau So we still need this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one test where this is still needed. I'm inclined to just handle it as follow-up work

Comment on lines +319 to +323
maybe_raise = not (
func.__name__ == "agg"
and len(args) > 0
and args[0] not in NUMERIC_ONLY_NOT_IMPLEMENTED
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this is to catch when operations in NUMERIC_ONLY_NOT_IMPLEMENTED are being used inside an agg(...) call

Comment on lines 642 to 647
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message="In a future version, the Index constructor will not infer numeric dtypes",
category=FutureWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to other changes in this PR or known flaky tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a flaky test happy.

dask/dataframe/tests/test_arithmetics_reduction.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_categorical.py Outdated Show resolved Hide resolved
Comment on lines 139 to 140
assert_eq(ddf.groupby(ddf.w).y.nunique(), df.groupby(df.w).y.nunique())
assert_eq(ddf.y.groupby(ddf.w).count(), df.y.groupby(df.w).count())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why these are indented. Are we emitting warnings now? I would have expected us to match warnings from pandas and, since pandas didn't appear to be warning before, I'm confused why we might be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas was warning before, we didn't. Now we do, have to catch it.

dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved
dask/dataframe/tests/test_groupby.py Outdated Show resolved Hide resolved
@j-bennet
Copy link
Contributor Author

j-bennet commented Feb 3, 2023

The last failure with minimal dependencies does not seem to be related... possibly flaky?

https://github.com/dask/dask/actions/runs/4079092927/jobs/7030092277

FAILED dask/dataframe/tests/test_arithmetics_reduction.py::test_reductions_frame_dtypes[True-min-None] - AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  object
[right]: float64
FAILED dask/dataframe/tests/test_arithmetics_reduction.py::test_reductions_frame_dtypes[True-max-None] - AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  object
[right]: float64

@jrbourbeau
Copy link
Member

The last failure with minimal dependencies does not seem to be related... possibly flaky?

Hmm I've not seen those before. My guess is they're somehow related to the changes in this PR

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out there were a couple of bugs in older versions of pandas that weren't straightforward to workaround. I pushed a9e27d9 which just skips those specific configurations for now. I'll also push up a PR that bumps our minimum pandas version (it's been a while since we've done that).

@@ -117,7 +117,38 @@ def test_concat_unions_categoricals():
tm.assert_frame_equal(_concat(frames5), pd.concat(frames6))


def test_unknown_categoricals(shuffle_method):
# TODO: Remove the filterwarnings below
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can we do this TODO?

@jrbourbeau jrbourbeau changed the title numeric_only support for simple aggs, pandas 2.0 compatibility Support numeric_only for simple groupby aggregations for pandas 2.0 compatibility Feb 10, 2023
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @j-bennet!

@jrbourbeau jrbourbeau merged commit 052a625 into dask:main Feb 10, 2023
@j-bennet j-bennet deleted the j-bennet/9736-groupby-numeric-only branch February 14, 2023 05:17
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Feb 16, 2023
This PR adds `dtypes` property to `GroupBy`, this will also fix some upstream dask breaking changes introduced in: dask/dask#9889

Issue was discovered in: #12768 (comment)

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)

URL: #12783
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants