-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] group by aggregations that include a list of strings as the grouping type #10181
Comments
Please do. Non-equality comparison is a different beast. |
For groupby aggregate, we need a list hash and a list equality, neither of which is supported right now. But I think it's easier to do both of these rather than implement a list lexicographical comparator, which would be needed for sort based groupby. With the former, we can do a hash groupby where the output order of grouped keys is non deterministic. And you can't even sort after the groupby aggregate. It'll just have to be in arbitrary order. I'm working on list equality right now and will probably use a similar technique to do a hash. |
I filed #10184 for sorting lists of strings. I will delete it from the description here. |
Just realized, this looks like a special case of #8039. |
Yes this very much is a special case of #8039. If you want to duplicate them, I am fine with it. I just wanted to be sure that the Spark requirements were properly captured. |
This issue has been labeled |
Related to #8039 and #10181 Contributes to #10186 This PR updates `groupby::hash` to use new row operators. It gets rid of the current "flattened nested column" logic and allows `groupby::hash` to handle `LIST` and `STRUCT` keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments. It becomes a breaking PR since the updated `groupby::hash` will treat inner nulls as equal when top-level nulls are excluded while the current behavior treats inner nulls as **unequal**. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) URL: #10770
Is your feature request related to a problem? Please describe.
We have a customer that would like to perform group by aggregations in Spark on a column that contains a list of strings.
NVIDIA/spark-rapids#4656
Describe the solution you'd like
To be specific we want to be able to create a
cudf::group_by
where one of the columns inkeys
is a column of typeLIST
that holds a child data column of typeSTRING
.The only operation we need to perform on it is
aggregate
.For the grouping Spark defines equality recursively for its types. So
STRING
equality is the same as it is forbinary_op::NULL_EQUALS
or when doing a group by on anotherSTRING
column. ALIST
is equal to anotherLIST
if and only if both areNULL
or both have the same number of elements and all of the elements compare as equal to each other.We would also like to be able to sort a table where one of the sort keys is the same data type, a
LIST
ofSTRING
s. I have filed #10184 for this as it is used in cases when the intermediate aggregation results have grown too large.Describe alternatives you've considered
We could try to play games with deliminators and rewrite the list of strings into just a column of longer strings, but then we have issues with nulls, and all kinds of other things. It really gets to be hard to do.
The text was updated successfully, but these errors were encountered: