Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] min and max aggregation and reduction for single level struct types #7995

Closed
revans2 opened this issue Apr 19, 2021 · 4 comments
Closed
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Apr 19, 2021

Is your feature request related to a problem? Please describe.
We would like to be able to support the min and max SQL operators for single level structs (No nesting) in Spark. By this I mean that a struct of strings is required to be supported, but a struct of lists or a struct of other structs is not.

Describe the solution you'd like
It would be great if we could get the existing min and max aggregations to work on non-nested structs for group by aggregations and reductions. Bonus points if we can make it work for window operations too, but that is not a requirement.

Describe alternatives you've considered
Because of how the group by aggregations work there really is no way for us to pick apart the struct and do the aggregation using existing APIs.

Additional context
This should follow the rules that min/max already support, and also match with sorting for structs #7422

Min should return the minimum non-null value in the range. If all of the values in the range are null then null is returned.
The minimum value should be determined by the ordering of a struct with in ascending order with nulls last for each of the child columns.

Max should return the maximum non-null value in the range. If all of the values in the range are null then null is returned.
The maximum value should be determined by the ordering of a struct with in ascending order with nulls last for each of the child columns.

scala> df.orderBy(col("b"), col("a")).show
+------+---+
|     a|  b|
+------+---+
|  null|  1|
|   [,]|  1|
| [, 1]|  1|
|  [1,]|  1|
|[1, 1]|  1|
|  null|  2|
+------+---+


scala> df.groupBy("b").agg(max(col("a")), min(col("a"))).show
+---+------+------+
|  b|max(a)|min(a)|
+---+------+------+
|  1|[1, 1]|   [,]|
|  2|  null|  null|
+---+------+------+

In this case we have a struct column a with two string entries in it. If the string is printed as an empty string it is really null (Sorry that is how spark does it) So the minimum value for group 1 is [,] or both null values. This is because the null struct itself is filtered out and nulls child columns are last (min) in ascending order. Where as for group 2 where all of the structs are null, then a null is returned.

I would love to hear from the python side if they have different requirements.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Apr 19, 2021
@kkraus14
Copy link
Collaborator

Python generally doesn't have strict requirements for struct columns since Pandas stores a column of Python dictionaries in this situation and defers to Python dictionary semantics which doesn't allow comparisons. Generally, I would just push for our null handling semantics to be consistent with parent and children columns.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Apr 20, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor Author

revans2 commented May 20, 2021

this is still wanted

@jrhemstad jrhemstad added the 0 - Backlog In queue waiting for assignment label May 20, 2021
@ttnghia ttnghia self-assigned this Oct 18, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 10, 2021
… aggregate() and scan() (#9545)

This PR adds struct support for `min`, `max`, `argmin` and `argmax` in groupby aggregation, and `min`, `max` in groupby scan. Although these operations require implementation for both hash-based and sort-based approaches, this PR only adds struct support into the sort-based approach due to the complexity of the hash-based implementation. Struct support for the hash-based approach will be future work.

Partially addresses #8974 and #7995.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9545
@ttnghia
Copy link
Contributor

ttnghia commented Jan 18, 2022

This should be closed as it has been addressed.
Other remaining structs works are here: #8974

@ttnghia ttnghia closed this as completed Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

4 participants