Support Hash-based group by aggregations for min/max with nesting #17241
Labels
feature request
New feature or request
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
This is a follow on issue for #8974
In Spark we currently support min and max aggregations for nested structs and lists. From a performance standpoint sorting is very expensive and it would be super great if there was a way for us to speed up these types of operations. We think that a hash aggregate would be a huge win. As an example the following spark queries are exactly the same. The only difference is that a long value is stored as a top level column vs being stored in a struct. The performance difference is 9x.
That said the goal is performance, not to implement this just to implement it. I also realize that this is a very contrived case, and a struct with a single value in it is not the goal. The goal is to speed up simple min/max aggregations for nested types. I also don't expect the performance of a nested min/max to match the performance of something that can support atomic operations as a part of the aggregation.
The text was updated successfully, but these errors were encountered: