Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support collect aggregations in reduction #7807

Closed
Tracked by #2062
sperlingxx opened this issue Apr 1, 2021 · 4 comments · Fixed by #10353
Closed
Tracked by #2062

[FEA] support collect aggregations in reduction #7807

sperlingxx opened this issue Apr 1, 2021 · 4 comments · Fixed by #10353
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@sperlingxx
Copy link
Contributor

sperlingxx commented Apr 1, 2021

Is your feature request related to a problem? Please describe.
Currently, cuDF supports collect aggregations in rolling windows and groupBy context (though the support is not complete). But, collect aggregations in reduction context is still missing.

Additional context
I believe we can not support this feature until #5887 got solved, since scalar of ListType is essential in reduction to collect.

@sperlingxx sperlingxx added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Apr 1, 2021
@sperlingxx sperlingxx changed the title [FEA] Support collect aggregations in reduction [FEA] support collect aggregations in reduction Apr 1, 2021
@jrhemstad
Copy link
Contributor

jrhemstad commented Apr 1, 2021

A collect reduction on a whole column doesn't make much sense to me. Turn an entire column into a list column with a single list?

@sperlingxx
Copy link
Contributor Author

sperlingxx commented Apr 2, 2021

@jrhemstad Yes, it basically turns an entire column into a list column with a single list. We need this feature because we want to provide GPU support of spark built-in function collect_list(collect_set) with reduction. Here is the example:

> SELECT collect_list(col) FROM VALUES (1), (2), (1) AS tab(col);
 [1,2,1]
> SELECT collect_set(col) FROM VALUES (1), (2), (1) AS tab(col);
 [1,2]

So, we need a method to produce scalars (in ListType) from corresponding input columns. It matches to the semantic of reduce. Or is there alternative approach to achieve this? Perhaps we can achieve the goal more directly with a method like make_list_scalar_from_column. And we also need to take care of null values according to null policy.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 6, 2021
@github-actions
Copy link

github-actions bot commented May 6, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@sperlingxx sperlingxx self-assigned this Feb 14, 2022
rapids-bot bot pushed a commit that referenced this issue Mar 10, 2022
Closes #7807 

Curreent PR is to support the collect aggregation family in reduction context, which includes collect_list, collect_set, merge_lists, and merge_sets.
The implementations are inspired by corresponding collect aggregations in groupby context.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #10353
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants