Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] list aggregation operator #9135

Closed
revans2 opened this issue Aug 27, 2021 · 7 comments · Fixed by #9621
Closed

[FEA] list aggregation operator #9135

revans2 opened this issue Aug 27, 2021 · 7 comments · Fixed by #9621
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 27, 2021

Is your feature request related to a problem? Please describe.
In Spark we have run into a few situations where we want to sum all of the values in a list, or get the min value in a list, etc.

Describe the solution you'd like
It would really be great if we could have an API that would let us do aggregation operations on all of the values in a list. This is essentially a sort based group by aggregation where the data is already sorted. In fact when digging into the groupby sort implementation I see APIs that are very close to what we would want. The main thing we would need is a way to convert the offsets in a list column to the group_labels that are passed into the aggregate functions. In fact in some cases we know that we will be working on multiple lists, all with the same set of offsets, so we could cache the group_labels. Not a requirement, but it would be nice.

Describe alternatives you've considered
We can do this today, but it is not as efficient as we would like.

  1. Create a sequence from 0 to the number of rows.
  2. Put it with the List column in a table and explode the table on the list column.
  3. Do a sort based aggregation on the exploded table with the exploded sequence column as the input.

Additional context
There is a generic operator in spark that also wants to do aggregations on lists. It uses higher order functions to define how to do those aggregations, so right now the plan is to do pattern matching to translate those into specific aggregations.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Aug 27, 2021
@jrhemstad jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 27, 2021
@jrhemstad
Copy link
Contributor

At first glance, my intuition is to add an aggregate_lists function that takes a lists_column_view and an aggregation and spits out the per-list aggregation. Internally it would likely reuse some of the sort groupby functionality.

@revans2
Copy link
Contributor Author

revans2 commented Aug 27, 2021

That sounds great.

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 27, 2021

Perhaps slightly more generic would be a segmented_reduce API that takes a column to aggregate and a column of offsets defining the boundaries of the segments. Not sure if that would be more broadly useful.

@revans2
Copy link
Contributor Author

revans2 commented Aug 31, 2021

Either API works for me. The latter looks more generic because we can get the offsets from anywhere and use them directly in the API. But it is not that hard to stitch the offsets back into a column_view along with the data column.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@isVoid
Copy link
Contributor

isVoid commented Dec 13, 2021

Today's reduction are null-preserving (e.g. MAX([2, null, 4]) == null). Are there any needs to support the null-skipping reduction in future e.g. NULL_MAX([2, null, 4]) == 4? @revans2
cc. @bdice

@bdice
Copy link
Contributor

bdice commented Dec 13, 2021

I realized after my discussion with @isVoid earlier today that we're likely describing null_policy::INCLUDE / null_policy::EXCLUDE. (Thanks to @ttnghia for the pointer.) This exists in other reductions in include/cudf/reductions.hpp already. It seems like we'll need to support it (if @revans2 agrees) since Spark's behavior is sum([1, null, 3]) == 4 (equivalent to null_policy::EXCLUDE in my understanding) while Python and others will probably expect sum([1, null, 3]) == null (equivalent to null_policy::INCLUDE in my understanding).

rapids-bot bot pushed a commit that referenced this issue Mar 10, 2022
closes #9135 
closes #9552 

This PR adds support for numeric types to `simple_op`, `sum`, `prod`, `min`, `max`, `any`, `all`. Also, this PR adds `segmented_null_mask_reduction` to compute null mask reductions on segments.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Bradley Dice (https://github.com/bdice)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9621
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants