[FEA] group by aggregations that include a list of strings as the grouping type #10181

revans2 · 2022-02-01T15:39:16Z

Is your feature request related to a problem? Please describe.
We have a customer that would like to perform group by aggregations in Spark on a column that contains a list of strings.

NVIDIA/spark-rapids#4656

Describe the solution you'd like
To be specific we want to be able to create a cudf::group_by where one of the columns in keys is a column of type LIST that holds a child data column of type STRING.

The only operation we need to perform on it is aggregate.

For the grouping Spark defines equality recursively for its types. So STRING equality is the same as it is for binary_op::NULL_EQUALS or when doing a group by on another STRING column. A LIST is equal to another LIST if and only if both are NULL or both have the same number of elements and all of the elements compare as equal to each other.

We would also like to be able to sort a table where one of the sort keys is the same data type, a LIST of STRINGs. I have filed #10184 for this as it is used in cases when the intermediate aggregation results have grown too large.

Describe alternatives you've considered
We could try to play games with deliminators and rewrite the list of strings into just a column of longer strings, but then we have issues with nulls, and all kinds of other things. It really gets to be hard to do.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2022-02-01T16:20:59Z

We would also like to be able to sort a table where one of the sort keys is the same data type, a LIST of STRINGs. This is not a hard requirement because we fall back to sorting the data if the intermediate results are so large that we cannot fit them all in GPU memory at once. If you want me to split this off into a separate request I am happy to do it.

Please do. Non-equality comparison is a different beast.

devavret · 2022-02-01T16:35:31Z

For groupby aggregate, we need a list hash and a list equality, neither of which is supported right now. But I think it's easier to do both of these rather than implement a list lexicographical comparator, which would be needed for sort based groupby.

With the former, we can do a hash groupby where the output order of grouped keys is non deterministic. And you can't even sort after the groupby aggregate. It'll just have to be in arbitrary order.

I'm working on list equality right now and will probably use a similar technique to do a hash.

revans2 · 2022-02-01T17:56:31Z

I filed #10184 for sorting lists of strings. I will delete it from the description here.

devavret · 2022-02-01T19:17:05Z

Just realized, this looks like a special case of #8039.

revans2 · 2022-02-02T15:29:23Z

Yes this very much is a special case of #8039. If you want to duplicate them, I am fine with it. I just wanted to be sure that the Spark requirements were properly captured.

github-actions · 2022-03-04T16:08:27Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2022-03-07T21:34:32Z

This is still wanted @devavret should I close this as a dupe of #8039?

Related to #8039 and #10181 Contributes to #10186 This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments. It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded while the current behavior treats inner nulls as **unequal**. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) URL: #10770

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 1, 2022

revans2 mentioned this issue Feb 1, 2022

[FEA] Support Group-By on Array[String] NVIDIA/spark-rapids#4656

Closed

revans2 mentioned this issue Feb 1, 2022

[FEA] Support sorting on lists of Strings #10184

Closed

devavret mentioned this issue Feb 1, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

github-actions bot added the inactive-30d label Mar 4, 2022

github-actions bot removed the inactive-30d label Mar 7, 2022

jrhemstad added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Mar 8, 2022

PointKernel self-assigned this May 9, 2022

PointKernel mentioned this issue May 10, 2022

Update groupby::hash to use new row operators for keys #10770

Merged

PointKernel mentioned this issue Sep 27, 2022

Support nested types as groupby keys in libcudf #11792

Merged

3 tasks

rapids-bot bot closed this as completed in #11792 Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] group by aggregations that include a list of strings as the grouping type #10181

[FEA] group by aggregations that include a list of strings as the grouping type #10181

revans2 commented Feb 1, 2022 •

edited

Loading

jrhemstad commented Feb 1, 2022

devavret commented Feb 1, 2022

revans2 commented Feb 1, 2022

devavret commented Feb 1, 2022

revans2 commented Feb 2, 2022

github-actions bot commented Mar 4, 2022

revans2 commented Mar 7, 2022

[FEA] group by aggregations that include a list of strings as the grouping type #10181

[FEA] group by aggregations that include a list of strings as the grouping type #10181

Comments

revans2 commented Feb 1, 2022 • edited Loading

jrhemstad commented Feb 1, 2022

devavret commented Feb 1, 2022

revans2 commented Feb 1, 2022

devavret commented Feb 1, 2022

revans2 commented Feb 2, 2022

github-actions bot commented Mar 4, 2022

revans2 commented Mar 7, 2022

revans2 commented Feb 1, 2022 •

edited

Loading