Add basic multi-partition `GroupBy` support to cuDF-Polars #17503

rjzamora · 2024-12-04T00:33:33Z

Description

Adds multi-partition support for simple GroupBy aggregations (following the same design as #17441)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-04T00:33:37Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rjzamora · 2024-12-04T15:31:37Z

/ok to test

wence- · 2024-12-04T16:16:08Z

python/cudf_polars/cudf_polars/experimental/groupby.py

+    # Check that we are grouping on element-wise
+    # keys (is this already guaranteed?)
+    for ne in ir.keys:
+        if not isinstance(ne.value, Col):  # pragma: no cover
+            return _single_fallback(ir, children, partition_info)


What do you mean by elementwise keys? It's certainly not the case that we always group on columns. But I think it is the case that the group keys (if expressions) are trivially elementwise (e.g. a + b as a key is fine, but a.unique() or a.sort() is not)

Right. I'm being extra cautious by requiring the keys to be Col. This comment is essentially asking: "can we drop this check altogether? ie. Will the keys always be element-wise?

I believe so, yes

Opened pola-rs/polars#20152 as well

wence- · 2024-12-04T17:14:50Z

python/cudf_polars/cudf_polars/experimental/groupby.py

+    agg_requests_pwise = []  # Partition-wise requests
+    agg_requests_tree = []  # Tree-node requests
+
+    for ne in ir.agg_requests:


We need to think about this (and possibly reorganise what we're doing in the single-partition case) to make this easier to handle.

For example, I think it is going to do the wrong thing for .agg(a.max() + b.min())

I think what you're trying to do here is turn a GroupBy(df, keys, aggs) into Reduce(LocalGroupBy(df, keys, agg_exprs), keys, transformed_aggs)

And what does this look like, I think once we've determined the "leaf" aggregations we're performing (e.g. col.max()) then we must concat and combine to get the full leaf aggregations, followed by evaluation of the column expressions that produce the final result.

So suppose we have determined what the leaf aggs are, and then what the post-aggregation expressions are, for a single-partition this is effectively Select(GroupBy(df, keys, leaf_aggs), keys, post_agg_exprs) where post_agg_exprs are all guaranteed elementwise (for now).

thought: Would it be easier for you here if the GroupBy IR nodes really only held aggregation expressions that are "leaf" aggregations (with the post-processing done in a Select)?

I think it would, because then the transform becomes something like:

Select( GroupByCombine(GroupBy(df, keys, leaf_aggs), keys, post_aggs), keys, post_agg_exprs )

Where groupbycombine emits the tree-reduction tasks with the post aggregations.

thought: Would it be easier for you here if the GroupBy IR nodes really only held aggregation expressions that are "leaf" aggregations (with the post-processing done in a Select)?

I'm pretty sure the answer is "yes" :)

Quick follow-up: I totally agree that we probably want to revise the upstream GroupBy design to make the decomposition here a bit simpler. With that said, I don't think we are doing anything "wrong" here. Rather, the code would just need to become unnecessarily messy if we wanted to do much more than "simple" mean/count/min/max aggregations.

For example, I think it is going to do the wrong thing for .agg(a.max() + b.min())

We won't do the "wrong" thing here - We will just raise an error. E.g.:

polars.exceptions.ComputeError: NotImplementedError: GroupBy does not support multiple partitions for this expression: BinOp(<pylibcudf.types.DataType object at 0x7f06ebcc63b0>, <binary_operator.ADD: 0>, Cast(<pylibcudf.types.DataType object at 0x7f06ebcc63b0>, Agg(<pylibcudf.types.DataType object at 0x7f06ebcc6370>, 'max', False, Col(<pylibcudf.types.DataType object at 0x7f06ebcc6370>, 'x'))), Agg(<pylibcudf.types.DataType object at 0x7f06ebcc63b0>, 'max', False, Col(<pylibcudf.types.DataType object at 0x7f06ebcc63b0>, 'z')))

python/cudf_polars/cudf_polars/experimental/groupby.py

…-multi-groupby

rjzamora · 2024-12-19T18:56:37Z

This PR is pretty-much "ready" - I don't think it makes sense to build more groupby logic directly on top of this. It would be much better to revise the underlying GroupBy class to make decomposition easier. However, it seems valuable to have basic groupby support available for testing/benchmarking before we have time to prioritize a larger GroupBy revision.

basic groupby-aggregation support

f0964a6

rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change cudf.pandas Issues specific to cudf.pandas labels Dec 4, 2024

rjzamora self-assigned this Dec 4, 2024

Merge branch 'branch-25.02' into cudf-polars-multi-groupby

1329cf1

github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars and removed cudf.pandas Issues specific to cudf.pandas labels Dec 4, 2024

Merge branch 'branch-25.02' into cudf-polars-multi-groupby

11a03f8

wence- reviewed Dec 4, 2024

View reviewed changes

rjzamora added 4 commits December 4, 2024 14:12

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

a9fa486

…-multi-groupby

remove GroupbyTree

b1224a0

simplify lower

385f03a

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

8956215

…-multi-groupby

rjzamora marked this pull request as ready for review December 6, 2024 04:22

rjzamora requested a review from a team as a code owner December 6, 2024 04:22

rjzamora requested review from vyasr and galipremsagar December 6, 2024 04:22

rjzamora added 4 commits December 19, 2024 10:14

Merge remote-tracking branch 'upstream/branch-25.02' into cudf-polars…

70b29b2

…-multi-groupby

cleanup

3f04eca

no cover

e090de5

tweak error message

24b88f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic multi-partition `GroupBy` support to cuDF-Polars #17503

Add basic multi-partition `GroupBy` support to cuDF-Polars #17503

rjzamora commented Dec 4, 2024

copy-pr-bot bot commented Dec 4, 2024

rjzamora commented Dec 4, 2024

wence- Dec 4, 2024

rjzamora Dec 4, 2024

wence- Dec 4, 2024

wence- Dec 4, 2024

wence- Dec 4, 2024

rjzamora Dec 4, 2024

rjzamora Dec 19, 2024

rjzamora commented Dec 19, 2024

Add basic multi-partition GroupBy support to cuDF-Polars #17503

Are you sure you want to change the base?

Add basic multi-partition GroupBy support to cuDF-Polars #17503

Conversation

rjzamora commented Dec 4, 2024

Description

Checklist

copy-pr-bot bot commented Dec 4, 2024

rjzamora commented Dec 4, 2024

wence- Dec 4, 2024

Choose a reason for hiding this comment

rjzamora Dec 4, 2024

Choose a reason for hiding this comment

wence- Dec 4, 2024

Choose a reason for hiding this comment

wence- Dec 4, 2024

Choose a reason for hiding this comment

wence- Dec 4, 2024

Choose a reason for hiding this comment

rjzamora Dec 4, 2024

Choose a reason for hiding this comment

rjzamora Dec 19, 2024

Choose a reason for hiding this comment

rjzamora commented Dec 19, 2024

Add basic multi-partition `GroupBy` support to cuDF-Polars #17503

Add basic multi-partition `GroupBy` support to cuDF-Polars #17503