Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop optimizing queries twice #2369

Merged
merged 6 commits into from
Apr 29, 2022
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Apr 28, 2022

Which issue does this PR close?

Closes #2368

Rationale for this change

Why do something twice when you can do it once.

I see speedup of 8% - 11% in the included criterion benchmark for SQL planning.

What changes are included in this PR?

  • SQL execution no longer optimizes the logical plan before creating the physical plan
  • DataFrame execution no longer optimizes the logical plan before creating the physical plan
  • Optimization happens once when creating physical plan
  • Criterion bench added for SQL planning

Are there any user-facing changes?

Yes. Users will now see unoptimized plans and maybe we will need to make changes to EXPLAIN before we merge this so that they still have a way to see the optimized plan before execution?

@andygrove andygrove self-assigned this Apr 28, 2022
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Apr 28, 2022
@andygrove
Copy link
Member Author

@alamb @Dandandan @matthewmturner This might be a step backward in UX so I left this as a draft while we discuss.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove .

There were previous tickets / PRs on this topic:
#1182
#1183
#705

As I recall the issue was that we were worried that some of the DataFrame APIs would allow running unoptimized plans. However, when looking through this API it seems like we always run an optimized plan 🤔 I can't remember what the problem was

The idea of only optimizing once sounds like a good idea to me

cc @xudong963 and @houqp

plan(
ctx.clone(),
"SELECT t1.a99, t2.b99 \
FROM t1, t2 WHERE a199 = b199",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe would be nice to add a GROUP BY as well :)

It is great to start getting benchmarks on the planner 👍

.sql("SELECT * FROM (SELECT 1) AS one WHERE TRUE AND TRUE")
.await?;

assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could address the UX by making to_logical_plan call optimize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I like that idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I made that change and I think that worked out well. The last commit is quite large because I had to make to_logical_plan return a Result now so had to update the call sites and I also stopped using this method internally within DataFrame and replaced those calls with self.plan.clone() instead (since that is what to_logical_plan was originally doing).

@andygrove andygrove marked this pull request as ready for review April 29, 2022 14:07
@andygrove andygrove merged commit 6a69f52 into apache:master Apr 29, 2022
@andygrove andygrove deleted the sql-optimize-once branch April 29, 2022 15:12
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great -- thank you @andygrove

the issue kept coming up so I am glad you finally found a solution!

comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Apr 30, 2022
ovr pushed a commit to cube-js/arrow-datafusion that referenced this pull request Jun 29, 2022
ovr pushed a commit to cube-js/arrow-datafusion that referenced this pull request Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL queries are optimized twice
2 participants