[FEA] list aggregation operator #9135

revans2 · 2021-08-27T17:46:29Z

Is your feature request related to a problem? Please describe.
In Spark we have run into a few situations where we want to sum all of the values in a list, or get the min value in a list, etc.

Describe the solution you'd like
It would really be great if we could have an API that would let us do aggregation operations on all of the values in a list. This is essentially a sort based group by aggregation where the data is already sorted. In fact when digging into the groupby sort implementation I see APIs that are very close to what we would want. The main thing we would need is a way to convert the offsets in a list column to the group_labels that are passed into the aggregate functions. In fact in some cases we know that we will be working on multiple lists, all with the same set of offsets, so we could cache the group_labels. Not a requirement, but it would be nice.

Describe alternatives you've considered
We can do this today, but it is not as efficient as we would like.

Create a sequence from 0 to the number of rows.
Put it with the List column in a table and explode the table on the list column.
Do a sort based aggregation on the exploded table with the exploded sequence column as the input.

Additional context
There is a generic operator in spark that also wants to do aggregations on lists. It uses higher order functions to define how to do those aggregations, so right now the plan is to do pattern matching to translate those into specific aggregations.

jrhemstad · 2021-08-27T18:02:10Z

At first glance, my intuition is to add an aggregate_lists function that takes a lists_column_view and an aggregation and spits out the per-list aggregation. Internally it would likely reuse some of the sort groupby functionality.

revans2 · 2021-08-27T20:29:30Z

That sounds great.

jrhemstad · 2021-08-27T20:38:53Z

Perhaps slightly more generic would be a segmented_reduce API that takes a column to aggregate and a column of offsets defining the boundaries of the segments. Not sure if that would be more broadly useful.

revans2 · 2021-08-31T14:49:51Z

Either API works for me. The latter looks more generic because we can get the offsets from anywhere and use them directly in the API. But it is not that hard to stitch the offsets back into a column_view along with the data column.

github-actions · 2021-11-15T21:03:50Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

isVoid · 2021-12-13T21:01:09Z

Today's reduction are null-preserving (e.g. MAX([2, null, 4]) == null). Are there any needs to support the null-skipping reduction in future e.g. NULL_MAX([2, null, 4]) == 4? @revans2
cc. @bdice

bdice · 2021-12-13T22:52:18Z

I realized after my discussion with @isVoid earlier today that we're likely describing null_policy::INCLUDE / null_policy::EXCLUDE. (Thanks to @ttnghia for the pointer.) This exists in other reductions in include/cudf/reductions.hpp already. It seems like we'll need to support it (if @revans2 agrees) since Spark's behavior is sum([1, null, 3]) == 4 (equivalent to null_policy::EXCLUDE in my understanding) while Python and others will probably expect sum([1, null, 3]) == null (equivalent to null_policy::INCLUDE in my understanding).

closes #9135 closes #9552 This PR adds support for numeric types to `simple_op`, `sum`, `prod`, `min`, `max`, `any`, `all`. Also, this PR adds `segmented_null_mask_reduction` to compute null mask reductions on segments. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Robert (Bobby) Evans (https://github.com/revans2) - Bradley Dice (https://github.com/bdice) - Jake Hemstad (https://github.com/jrhemstad) URL: #9621

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 27, 2021

jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 27, 2021

revans2 mentioned this issue Aug 31, 2021

[BUG] IllegalMemoryAccess sometimes on sorted group by min of strings #9156

Closed

bdice mentioned this issue Oct 28, 2021

[FEA] libcudf functions for segmented_has_nulls / has_null_elements(list_view) / segmented_bitmask_reduction #9552

Closed

isVoid mentioned this issue Nov 5, 2021

Support segmented reductions and null mask reductions #9621

Merged

github-actions bot added the inactive-30d label Nov 15, 2021

github-actions bot removed the inactive-30d label Dec 13, 2021

rapids-bot bot closed this as completed in #9621 Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] list aggregation operator #9135

[FEA] list aggregation operator #9135

revans2 commented Aug 27, 2021

jrhemstad commented Aug 27, 2021

revans2 commented Aug 27, 2021

jrhemstad commented Aug 27, 2021 •

edited

Loading

revans2 commented Aug 31, 2021

github-actions bot commented Nov 15, 2021

isVoid commented Dec 13, 2021

bdice commented Dec 13, 2021 •

edited

Loading

[FEA] list aggregation operator #9135

[FEA] list aggregation operator #9135

Comments

revans2 commented Aug 27, 2021

jrhemstad commented Aug 27, 2021

revans2 commented Aug 27, 2021

jrhemstad commented Aug 27, 2021 • edited Loading

revans2 commented Aug 31, 2021

github-actions bot commented Nov 15, 2021

isVoid commented Dec 13, 2021

bdice commented Dec 13, 2021 • edited Loading

jrhemstad commented Aug 27, 2021 •

edited

Loading

bdice commented Dec 13, 2021 •

edited

Loading