-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup DFSchema::merge
using HashSet indices
#9020
Speedup DFSchema::merge
using HashSet indices
#9020
Conversation
Nice 😀 Trade off seems worth it. I'm curious what's going on in that TPCH benchmark though as the other physical select benchmark showed solid improvement so not sure it's all physical planning. |
Co-authored-by: comphead <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @simonvandel
I think this is an acceptable trade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @simonvandel 🙏
Anything I need to do before we can merge this? |
Thanks again @simonvandel and @matthewmturner and @comphead |
Which issue does this PR close?
Relates to #5637
Rationale for this change
Planning time of Datafusion is slow.
What changes are included in this PR?
Profiling shows that a large part of planning is spent in
DFSchema::merge
.Before this PR, merging was complexity
O(N * M)
, whereN
andM
are number of fields in the two schemas.With this PR, merging is complexity
O(N + M)
using a HashSet as an index.Speeds up benchmarks in sql_planner.rs by 18-75%. However, physical planning benchmarks seem to regress by 1.8%. I wonder if that is an acceptable trade?
CC @matthewmturner who seems interested in planning performance
Benchmark results
Are these changes tested?
Should be in existing tests.
Are there any user-facing changes?
Faster planning.