-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] list aggregation operator #9135
Comments
At first glance, my intuition is to add an |
That sounds great. |
Perhaps slightly more generic would be a |
Either API works for me. The latter looks more generic because we can get the offsets from anywhere and use them directly in the API. But it is not that hard to stitch the offsets back into a column_view along with the data column. |
This issue has been labeled |
I realized after my discussion with @isVoid earlier today that we're likely describing |
closes #9135 closes #9552 This PR adds support for numeric types to `simple_op`, `sum`, `prod`, `min`, `max`, `any`, `all`. Also, this PR adds `segmented_null_mask_reduction` to compute null mask reductions on segments. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Robert (Bobby) Evans (https://github.com/revans2) - Bradley Dice (https://github.com/bdice) - Jake Hemstad (https://github.com/jrhemstad) URL: #9621
Is your feature request related to a problem? Please describe.
In Spark we have run into a few situations where we want to sum all of the values in a list, or get the min value in a list, etc.
Describe the solution you'd like
It would really be great if we could have an API that would let us do aggregation operations on all of the values in a list. This is essentially a sort based group by aggregation where the data is already sorted. In fact when digging into the groupby sort implementation I see APIs that are very close to what we would want. The main thing we would need is a way to convert the offsets in a list column to the group_labels that are passed into the aggregate functions. In fact in some cases we know that we will be working on multiple lists, all with the same set of offsets, so we could cache the group_labels. Not a requirement, but it would be nice.
Describe alternatives you've considered
We can do this today, but it is not as efficient as we would like.
Additional context
There is a generic operator in spark that also wants to do aggregations on lists. It uses higher order functions to define how to do those aggregations, so right now the plan is to do pattern matching to translate those into specific aggregations.
The text was updated successfully, but these errors were encountered: