Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] combine compatible scalar subqueries #4186

Closed
revans2 opened this issue Nov 22, 2021 · 0 comments
Closed

[FEA] combine compatible scalar subqueries #4186

revans2 opened this issue Nov 22, 2021 · 0 comments
Assignees
Labels
performance A performance related task/issue task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Nov 22, 2021

This is something that we really should look into doing for Spark itself and not as a one off in the plugin.

TPC-DS query 9 will do a large number of scalar sub-queries to produce the final output. It actually ends up running about 16 jobs in total, 15 of which are scalar sub-queries.

A ScalarSubquery is a subquery that produces a single result (one column and one row). In most cases these are things like.

select IF(A > select avg(A) from my_table, "ABOVE AVERAGE", "NORMAL") from my_table

Here it is doing a reduction for on column A to find the average value and then trying to see if this particular column is above the average for the entire table. To do this Spark will run select avg(A) from my_table as a separate query, and then collect the result back to the driver and insert it into a ScalarSubquery expression that returns the result of running the sub-query.

This is all great when there is just one sub-query. But if there are a lot of them, like with TPC-DS query 9. You can end up with a lot of overhead launching multiple jobs. This includes reading the footers on the input files. Possibly reading in overlapping column data, etc.

I have found that if I manually combine the sub-queries based off of the input data and the where clause in the sub-query I can reduce the number of sub-query jobs that would need to run 6 instead of 16. i.e.

select IF(select COUNT(A) from my_table > 100, select AVG(B) from my_table, select AVG(c) from my_table) from my_table

Instead of running 3 separate sub-queries select COUNT(A) from my_table, select AVG(B) from my_table, and select AVG(c) from my_table we would re-write it to be a single query. select COUNT(A), AVG(B), AVG(c) from my_table, and then we could have an alternative version of ScalarSubquery that knows how to access different columns of the returned data. I predict that this would be able to cut the time of query 9 in half.

There are more complex things that we can do to combine more sub-queries together if the where clause of a reduction is not the same, and some things depending on how they overlap. This can involve things like putting an IF expression in the reductions based off of the where clause. The problem that I run into with this is that on the GPU we end up computing the where clause multiple times and it can be very slow compared to the CPU. I will file a separate issue to understand that.

@revans2 revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue task Work required that improves the product but is not user facing labels Nov 22, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 23, 2021
@sperlingxx sperlingxx self-assigned this Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

3 participants