Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Hash-based group by aggregations for min/max with nesting #17241

Open
revans2 opened this issue Nov 4, 2024 · 0 comments
Open

Support Hash-based group by aggregations for min/max with nesting #17241

revans2 opened this issue Nov 4, 2024 · 0 comments
Labels
feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Nov 4, 2024

This is a follow on issue for #8974

In Spark we currently support min and max aggregations for nested structs and lists. From a performance standpoint sorting is very expensive and it would be super great if there was a way for us to speed up these types of operations. We think that a hash aggregate would be a huge win. As an example the following spark queries are exactly the same. The only difference is that a long value is stored as a top level column vs being stored in a struct. The performance difference is 9x.

scala> spark.time(spark.range(10000000000L).selectExpr("id % 2 as k1", "CAST(id % 3 as STRING) as k2", "id as v").groupBy("k1", "k2").agg(max(col("v"))).orderBy("k1", "k2").show())
+---+---+----------+                                                            
| k1| k2|    max(v)|
+---+---+----------+
|  0|  0|9999999996|
|  0|  1|9999999994|
|  0|  2|9999999998|
|  1|  0|9999999999|
|  1|  1|9999999997|
|  1|  2|9999999995|
+---+---+----------+

Time taken: 4252 ms

scala> spark.time(spark.range(10000000000L).selectExpr("id % 2 as k1", "CAST(id % 3 as STRING) as k2", "struct(id) as v").groupBy("k1", "k2").agg(max(col("v"))).orderBy("k1", "k2").show())
+---+---+------------+                                                          
| k1| k2|      max(v)|
+---+---+------------+
|  0|  0|{9999999996}|
|  0|  1|{9999999994}|
|  0|  2|{9999999998}|
|  1|  0|{9999999999}|
|  1|  1|{9999999997}|
|  1|  2|{9999999995}|
+---+---+------------+

Time taken: 39208 ms

That said the goal is performance, not to implement this just to implement it. I also realize that this is a very contrived case, and a struct with a single value in it is not the goal. The goal is to speed up simple min/max aggregations for nested types. I also don't expect the performance of a nested min/max to match the performance of something that can support atomic operations as a part of the aggregation.

@revans2 revans2 added feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

1 participant