Support Hash-based group by aggregations for min/max with nesting #17241

revans2 · 2024-11-04T19:52:53Z

This is a follow on issue for #8974

In Spark we currently support min and max aggregations for nested structs and lists. From a performance standpoint sorting is very expensive and it would be super great if there was a way for us to speed up these types of operations. We think that a hash aggregate would be a huge win. As an example the following spark queries are exactly the same. The only difference is that a long value is stored as a top level column vs being stored in a struct. The performance difference is 9x.

scala> spark.time(spark.range(10000000000L).selectExpr("id % 2 as k1", "CAST(id % 3 as STRING) as k2", "id as v").groupBy("k1", "k2").agg(max(col("v"))).orderBy("k1", "k2").show())
+---+---+----------+                                                            
| k1| k2|    max(v)|
+---+---+----------+
|  0|  0|9999999996|
|  0|  1|9999999994|
|  0|  2|9999999998|
|  1|  0|9999999999|
|  1|  1|9999999997|
|  1|  2|9999999995|
+---+---+----------+

Time taken: 4252 ms

scala> spark.time(spark.range(10000000000L).selectExpr("id % 2 as k1", "CAST(id % 3 as STRING) as k2", "struct(id) as v").groupBy("k1", "k2").agg(max(col("v"))).orderBy("k1", "k2").show())
+---+---+------------+                                                          
| k1| k2|      max(v)|
+---+---+------------+
|  0|  0|{9999999996}|
|  0|  1|{9999999994}|
|  0|  2|{9999999998}|
|  1|  0|{9999999999}|
|  1|  1|{9999999997}|
|  1|  2|{9999999995}|
+---+---+------------+

Time taken: 39208 ms

That said the goal is performance, not to implement this just to implement it. I also realize that this is a very contrived case, and a struct with a single value in it is not the goal. The goal is to speed up simple min/max aggregations for nested types. I also don't expect the performance of a nested min/max to match the performance of something that can support atomic operations as a part of the aggregation.

The text was updated successfully, but these errors were encountered:

revans2 added feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Nov 4, 2024

revans2 mentioned this issue Nov 4, 2024

[FEA] support max and min aggregations for nested structs #8974

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Hash-based group by aggregations for min/max with nesting #17241

Support Hash-based group by aggregations for min/max with nesting #17241

revans2 commented Nov 4, 2024

Support Hash-based group by aggregations for min/max with nesting #17241

Support Hash-based group by aggregations for min/max with nesting #17241

Comments

revans2 commented Nov 4, 2024