-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop optimizing queries twice #2369
Conversation
@alamb @Dandandan @matthewmturner This might be a step backward in UX so I left this as a draft while we discuss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove .
There were previous tickets / PRs on this topic:
#1182
#1183
#705
As I recall the issue was that we were worried that some of the DataFrame APIs would allow running unoptimized plans. However, when looking through this API it seems like we always run an optimized plan 🤔 I can't remember what the problem was
The idea of only optimizing once sounds like a good idea to me
cc @xudong963 and @houqp
plan( | ||
ctx.clone(), | ||
"SELECT t1.a99, t2.b99 \ | ||
FROM t1, t2 WHERE a199 = b199", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe would be nice to add a GROUP BY
as well :)
It is great to start getting benchmarks on the planner 👍
.sql("SELECT * FROM (SELECT 1) AS one WHERE TRUE AND TRUE") | ||
.await?; | ||
|
||
assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could address the UX by making to_logical_plan
call optimize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I like that idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I made that change and I think that worked out well. The last commit is quite large because I had to make to_logical_plan
return a Result
now so had to update the call sites and I also stopped using this method internally within DataFrame
and replaced those calls with self.plan.clone()
instead (since that is what to_logical_plan
was originally doing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great -- thank you @andygrove
the issue kept coming up so I am glad you finally found a solution!
Which issue does this PR close?
Closes #2368
Rationale for this change
Why do something twice when you can do it once.
I see speedup of 8% - 11% in the included criterion benchmark for SQL planning.
What changes are included in this PR?
Are there any user-facing changes?
Yes. Users will now see unoptimized plans and maybe we will need to make changes to EXPLAIN before we merge this so that they still have a way to see the optimized plan before execution?