[FEA] combine compatible scalar subqueries #4186
Labels
performance
A performance related task/issue
task
Work required that improves the product but is not user facing
This is something that we really should look into doing for Spark itself and not as a one off in the plugin.
TPC-DS query 9 will do a large number of scalar sub-queries to produce the final output. It actually ends up running about 16 jobs in total, 15 of which are scalar sub-queries.
A ScalarSubquery is a subquery that produces a single result (one column and one row). In most cases these are things like.
Here it is doing a reduction for on column A to find the average value and then trying to see if this particular column is above the average for the entire table. To do this Spark will run
select avg(A) from my_table
as a separate query, and then collect the result back to the driver and insert it into aScalarSubquery
expression that returns the result of running the sub-query.This is all great when there is just one sub-query. But if there are a lot of them, like with TPC-DS query 9. You can end up with a lot of overhead launching multiple jobs. This includes reading the footers on the input files. Possibly reading in overlapping column data, etc.
I have found that if I manually combine the sub-queries based off of the input data and the where clause in the sub-query I can reduce the number of sub-query jobs that would need to run 6 instead of 16. i.e.
Instead of running 3 separate sub-queries
select COUNT(A) from my_table
,select AVG(B) from my_table
, andselect AVG(c) from my_table
we would re-write it to be a single query.select COUNT(A), AVG(B), AVG(c) from my_table
, and then we could have an alternative version of ScalarSubquery that knows how to access different columns of the returned data. I predict that this would be able to cut the time of query 9 in half.There are more complex things that we can do to combine more sub-queries together if the where clause of a reduction is not the same, and some things depending on how they overlap. This can involve things like putting an IF expression in the reductions based off of the where clause. The problem that I run into with this is that on the GPU we end up computing the where clause multiple times and it can be very slow compared to the CPU. I will file a separate issue to understand that.
The text was updated successfully, but these errors were encountered: