[FEA] min and max aggregation and reduction for single level struct types #7995

revans2 · 2021-04-19T16:07:26Z

Is your feature request related to a problem? Please describe.
We would like to be able to support the min and max SQL operators for single level structs (No nesting) in Spark. By this I mean that a struct of strings is required to be supported, but a struct of lists or a struct of other structs is not.

Describe the solution you'd like
It would be great if we could get the existing min and max aggregations to work on non-nested structs for group by aggregations and reductions. Bonus points if we can make it work for window operations too, but that is not a requirement.

Describe alternatives you've considered
Because of how the group by aggregations work there really is no way for us to pick apart the struct and do the aggregation using existing APIs.

Additional context
This should follow the rules that min/max already support, and also match with sorting for structs #7422

Min should return the minimum non-null value in the range. If all of the values in the range are null then null is returned.
The minimum value should be determined by the ordering of a struct with in ascending order with nulls last for each of the child columns.

Max should return the maximum non-null value in the range. If all of the values in the range are null then null is returned.
The maximum value should be determined by the ordering of a struct with in ascending order with nulls last for each of the child columns.

scala> df.orderBy(col("b"), col("a")).show
+------+---+
|     a|  b|
+------+---+
|  null|  1|
|   [,]|  1|
| [, 1]|  1|
|  [1,]|  1|
|[1, 1]|  1|
|  null|  2|
+------+---+


scala> df.groupBy("b").agg(max(col("a")), min(col("a"))).show
+---+------+------+
|  b|max(a)|min(a)|
+---+------+------+
|  1|[1, 1]|   [,]|
|  2|  null|  null|
+---+------+------+

In this case we have a struct column a with two string entries in it. If the string is printed as an empty string it is really null (Sorry that is how spark does it) So the minimum value for group 1 is [,] or both null values. This is because the null struct itself is filtered out and nulls child columns are last (min) in ascending order. Where as for group 2 where all of the structs are null, then a null is returned.

I would love to hear from the python side if they have different requirements.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-04-20T05:33:40Z

Python generally doesn't have strict requirements for struct columns since Pandas stores a column of Python dictionaries in this situation and defers to Python dictionary semantics which doesn't allow comparisons. Generally, I would just push for our null handling semantics to be consistent with parent and children columns.

github-actions · 2021-05-20T06:04:35Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2021-05-20T13:56:46Z

this is still wanted

… aggregate() and scan() (#9545) This PR adds struct support for `min`, `max`, `argmin` and `argmax` in groupby aggregation, and `min`, `max` in groupby scan. Although these operations require implementation for both hash-based and sort-based approaches, this PR only adds struct support into the sort-based approach due to the complexity of the hash-based implementation. Struct support for the hash-based approach will be future work. Partially addresses #8974 and #7995. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - David Wendt (https://github.com/davidwendt) - Jake Hemstad (https://github.com/jrhemstad) URL: #9545

ttnghia · 2022-01-18T05:06:32Z

This should be closed as it has been addressed.
Other remaining structs works are here: #8974

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Apr 19, 2021

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 20, 2021

github-actions bot added the inactive-30d label May 20, 2021

github-actions bot removed the inactive-30d label May 20, 2021

jrhemstad added the 0 - Backlog In queue waiting for assignment label May 20, 2021

ttnghia mentioned this issue Oct 15, 2021

[FEA] support max and min aggregations for nested structs #8974

Closed

6 tasks

ttnghia self-assigned this Oct 18, 2021

ttnghia mentioned this issue Oct 28, 2021

Support structs column in min, max, argmin and argmax groupby aggregate() and scan() #9545

Merged

ttnghia closed this as completed Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] min and max aggregation and reduction for single level struct types #7995

[FEA] min and max aggregation and reduction for single level struct types #7995

revans2 commented Apr 19, 2021

kkraus14 commented Apr 20, 2021

github-actions bot commented May 20, 2021

revans2 commented May 20, 2021

ttnghia commented Jan 18, 2022

[FEA] min and max aggregation and reduction for single level struct types #7995

[FEA] min and max aggregation and reduction for single level struct types #7995

Comments

revans2 commented Apr 19, 2021

kkraus14 commented Apr 20, 2021

github-actions bot commented May 20, 2021

revans2 commented May 20, 2021

ttnghia commented Jan 18, 2022