Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] group by aggregations that include a list of strings as the grouping type #10181

Closed
revans2 opened this issue Feb 1, 2022 · 7 comments · Fixed by #11792
Closed

[FEA] group by aggregations that include a list of strings as the grouping type #10181

revans2 opened this issue Feb 1, 2022 · 7 comments · Fixed by #11792
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Feb 1, 2022

Is your feature request related to a problem? Please describe.
We have a customer that would like to perform group by aggregations in Spark on a column that contains a list of strings.

NVIDIA/spark-rapids#4656

Describe the solution you'd like
To be specific we want to be able to create a cudf::group_by where one of the columns in keys is a column of type LIST that holds a child data column of type STRING.

The only operation we need to perform on it is aggregate.

For the grouping Spark defines equality recursively for its types. So STRING equality is the same as it is for binary_op::NULL_EQUALS or when doing a group by on another STRING column. A LIST is equal to another LIST if and only if both are NULL or both have the same number of elements and all of the elements compare as equal to each other.

We would also like to be able to sort a table where one of the sort keys is the same data type, a LIST of STRINGs. I have filed #10184 for this as it is used in cases when the intermediate aggregation results have grown too large.

Describe alternatives you've considered
We could try to play games with deliminators and rewrite the list of strings into just a column of longer strings, but then we have issues with nulls, and all kinds of other things. It really gets to be hard to do.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 1, 2022
@jrhemstad
Copy link
Contributor

We would also like to be able to sort a table where one of the sort keys is the same data type, a LIST of STRINGs. This is not a hard requirement because we fall back to sorting the data if the intermediate results are so large that we cannot fit them all in GPU memory at once. If you want me to split this off into a separate request I am happy to do it.

Please do. Non-equality comparison is a different beast.

@devavret
Copy link
Contributor

devavret commented Feb 1, 2022

For groupby aggregate, we need a list hash and a list equality, neither of which is supported right now. But I think it's easier to do both of these rather than implement a list lexicographical comparator, which would be needed for sort based groupby.

With the former, we can do a hash groupby where the output order of grouped keys is non deterministic. And you can't even sort after the groupby aggregate. It'll just have to be in arbitrary order.

I'm working on list equality right now and will probably use a similar technique to do a hash.

@revans2
Copy link
Contributor Author

revans2 commented Feb 1, 2022

I filed #10184 for sorting lists of strings. I will delete it from the description here.

@devavret
Copy link
Contributor

devavret commented Feb 1, 2022

Just realized, this looks like a special case of #8039.

@revans2
Copy link
Contributor Author

revans2 commented Feb 2, 2022

Yes this very much is a special case of #8039. If you want to duplicate them, I am fine with it. I just wanted to be sure that the Spark requirements were properly captured.

@github-actions
Copy link

github-actions bot commented Mar 4, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor Author

revans2 commented Mar 7, 2022

This is still wanted @devavret should I close this as a dupe of #8039?

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Mar 8, 2022
@PointKernel PointKernel self-assigned this May 9, 2022
rapids-bot bot pushed a commit that referenced this issue May 25, 2022
Related to #8039 and #10181 

Contributes to #10186 

This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments.

It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded 
 while the current behavior treats inner nulls as **unequal**.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Nghia Truong (https://github.com/ttnghia)
  - Devavret Makkar (https://github.com/devavret)

URL: #10770
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants