[FEA] combine compatible scalar subqueries #4186

revans2 · 2021-11-22T22:12:57Z

This is something that we really should look into doing for Spark itself and not as a one off in the plugin.

TPC-DS query 9 will do a large number of scalar sub-queries to produce the final output. It actually ends up running about 16 jobs in total, 15 of which are scalar sub-queries.

A ScalarSubquery is a subquery that produces a single result (one column and one row). In most cases these are things like.

select IF(A > select avg(A) from my_table, "ABOVE AVERAGE", "NORMAL") from my_table

Here it is doing a reduction for on column A to find the average value and then trying to see if this particular column is above the average for the entire table. To do this Spark will run select avg(A) from my_table as a separate query, and then collect the result back to the driver and insert it into a ScalarSubquery expression that returns the result of running the sub-query.

This is all great when there is just one sub-query. But if there are a lot of them, like with TPC-DS query 9. You can end up with a lot of overhead launching multiple jobs. This includes reading the footers on the input files. Possibly reading in overlapping column data, etc.

I have found that if I manually combine the sub-queries based off of the input data and the where clause in the sub-query I can reduce the number of sub-query jobs that would need to run 6 instead of 16. i.e.

select IF(select COUNT(A) from my_table > 100, select AVG(B) from my_table, select AVG(c) from my_table) from my_table

Instead of running 3 separate sub-queries select COUNT(A) from my_table, select AVG(B) from my_table, and select AVG(c) from my_table we would re-write it to be a single query. select COUNT(A), AVG(B), AVG(c) from my_table, and then we could have an alternative version of ScalarSubquery that knows how to access different columns of the returned data. I predict that this would be able to cut the time of query 9 in half.

There are more complex things that we can do to combine more sub-queries together if the where clause of a reduction is not the same, and some things depending on how they overlap. This can involve things like putting an IF expression in the reductions based off of the where clause. The problem that I run into with this is that on the GPU we end up computing the where clause multiple times and it can be very slow compared to the CPU. I will file a separate issue to understand that.

The text was updated successfully, but these errors were encountered:

revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue task Work required that improves the product but is not user facing labels Nov 22, 2021

revans2 mentioned this issue Nov 22, 2021

[FEA] understand why between is so expensive #4189

Closed

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 23, 2021

abellina mentioned this issue Dec 3, 2021

[BUG] q9 regression between 21.10 and 21.12 #4281

Closed

sperlingxx self-assigned this Jan 24, 2022

sperlingxx mentioned this issue Feb 9, 2022

Optimizer rule to combine compatible scalar subqueries #4726

Closed

sperlingxx closed this as completed Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] combine compatible scalar subqueries #4186

[FEA] combine compatible scalar subqueries #4186

revans2 commented Nov 22, 2021

[FEA] combine compatible scalar subqueries #4186

[FEA] combine compatible scalar subqueries #4186

Comments

revans2 commented Nov 22, 2021