Stop optimizing queries twice #2369

andygrove · 2022-04-28T15:19:17Z

Which issue does this PR close?

Closes #2368

Rationale for this change

Why do something twice when you can do it once.

I see speedup of 8% - 11% in the included criterion benchmark for SQL planning.

What changes are included in this PR?

SQL execution no longer optimizes the logical plan before creating the physical plan
DataFrame execution no longer optimizes the logical plan before creating the physical plan
Optimization happens once when creating physical plan
Criterion bench added for SQL planning

Are there any user-facing changes?

Yes. Users will now see unoptimized plans and maybe we will need to make changes to EXPLAIN before we merge this so that they still have a way to see the optimized plan before execution?

andygrove · 2022-04-28T15:20:44Z

@alamb @Dandandan @matthewmturner This might be a step backward in UX so I left this as a draft while we discuss.

alamb

Thanks @andygrove .

There were previous tickets / PRs on this topic:
#1182
#1183
#705

As I recall the issue was that we were worried that some of the DataFrame APIs would allow running unoptimized plans. However, when looking through this API it seems like we always run an optimized plan 🤔 I can't remember what the problem was

The idea of only optimizing once sounds like a good idea to me

cc @xudong963 and @houqp

alamb · 2022-04-28T21:02:34Z

datafusion/core/benches/sql_planner.rs

+            plan(
+                ctx.clone(),
+                "SELECT t1.a99, t2.b99  \
+                 FROM t1, t2 WHERE a199 = b199",


maybe would be nice to add a GROUP BY as well :)

It is great to start getting benchmarks on the planner 👍

alamb · 2022-04-28T21:11:58Z

datafusion/core/src/execution/context.rs

-            .sql("SELECT * FROM (SELECT 1) AS one WHERE TRUE AND TRUE")
-            .await?;
-
-        assert_eq!(


Maybe we could address the UX by making to_logical_plan call optimize?

Ah, yes, I like that idea.

@alamb I made that change and I think that worked out well. The last commit is quite large because I had to make to_logical_plan return a Result now so had to update the call sites and I also stopped using this method internally within DataFrame and replaced those calls with self.plan.clone() instead (since that is what to_logical_plan was originally doing).

alamb

This is great -- thank you @andygrove

the issue kept coming up so I am glad you finally found a solution!

andygrove added 2 commits April 28, 2022 09:04

Stop optimizing SQL queries twice

a73f1bf

remove dupe bench

15b7c75

andygrove self-assigned this Apr 28, 2022

github-actions bot added the datafusion Changes in the datafusion crate label Apr 28, 2022

merge from master

e5ce498

alamb approved these changes Apr 28, 2022

View reviewed changes

to_logical_plan now returns the optimized plan

2d58094

github-actions bot added the ballista label Apr 29, 2022

add another benchmark

9a2631f

andygrove marked this pull request as ready for review April 29, 2022 14:07

revert change to release profile

458839c

andygrove merged commit 6a69f52 into apache:master Apr 29, 2022

andygrove deleted the sql-optimize-once branch April 29, 2022 15:12

alamb reviewed Apr 29, 2022

View reviewed changes

comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Apr 30, 2022

Stop optimizing queries twice (apache#2369)

e354dc3

ovr pushed a commit to cube-js/arrow-datafusion that referenced this pull request Jun 29, 2022

Stop optimizing queries twice (apache#2369)

45bc3ee

ovr pushed a commit to cube-js/arrow-datafusion that referenced this pull request Jun 29, 2022

Stop optimizing queries twice (apache#2369)

d73dfda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop optimizing queries twice #2369

Stop optimizing queries twice #2369

andygrove commented Apr 28, 2022 •

edited

Loading

andygrove commented Apr 28, 2022

alamb left a comment

alamb Apr 28, 2022

alamb Apr 28, 2022

andygrove Apr 28, 2022

andygrove Apr 29, 2022

alamb left a comment

Stop optimizing queries twice #2369

Stop optimizing queries twice #2369

Conversation

andygrove commented Apr 28, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove commented Apr 28, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 28, 2022

Choose a reason for hiding this comment

alamb Apr 28, 2022

Choose a reason for hiding this comment

andygrove Apr 28, 2022

Choose a reason for hiding this comment

andygrove Apr 29, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

andygrove commented Apr 28, 2022 •

edited

Loading