-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] min and max aggregation and reduction for single level struct types #7995
Comments
Python generally doesn't have strict requirements for struct columns since Pandas stores a column of Python dictionaries in this situation and defers to Python dictionary semantics which doesn't allow comparisons. Generally, I would just push for our null handling semantics to be consistent with parent and children columns. |
This issue has been labeled |
this is still wanted |
… aggregate() and scan() (#9545) This PR adds struct support for `min`, `max`, `argmin` and `argmax` in groupby aggregation, and `min`, `max` in groupby scan. Although these operations require implementation for both hash-based and sort-based approaches, this PR only adds struct support into the sort-based approach due to the complexity of the hash-based implementation. Struct support for the hash-based approach will be future work. Partially addresses #8974 and #7995. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - David Wendt (https://github.com/davidwendt) - Jake Hemstad (https://github.com/jrhemstad) URL: #9545
This should be closed as it has been addressed. |
Is your feature request related to a problem? Please describe.
We would like to be able to support the
min
andmax
SQL operators for single level structs (No nesting) in Spark. By this I mean that a struct of strings is required to be supported, but a struct of lists or a struct of other structs is not.Describe the solution you'd like
It would be great if we could get the existing
min
andmax
aggregations to work on non-nested structs for group by aggregations and reductions. Bonus points if we can make it work for window operations too, but that is not a requirement.Describe alternatives you've considered
Because of how the group by aggregations work there really is no way for us to pick apart the struct and do the aggregation using existing APIs.
Additional context
This should follow the rules that min/max already support, and also match with sorting for structs #7422
Min should return the minimum non-null value in the range. If all of the values in the range are null then null is returned.
The minimum value should be determined by the ordering of a struct with in ascending order with nulls last for each of the child columns.
Max should return the maximum non-null value in the range. If all of the values in the range are null then null is returned.
The maximum value should be determined by the ordering of a struct with in ascending order with nulls last for each of the child columns.
In this case we have a struct column
a
with two string entries in it. If the string is printed as an empty string it is really null (Sorry that is how spark does it) So the minimum value for group 1 is[,]
or both null values. This is because the null struct itself is filtered out and nulls child columns are last (min) in ascending order. Where as for group 2 where all of the structs are null, then a null is returned.I would love to hear from the python side if they have different requirements.
The text was updated successfully, but these errors were encountered: