-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add optimize rule rewrite_disjunctive_predicate
#2858
Conversation
501fdb8
to
f690cf1
Compare
Codecov Report
@@ Coverage Diff @@
## master #2858 +/- ##
==========================================
+ Coverage 85.62% 85.69% +0.06%
==========================================
Files 279 280 +1
Lines 50965 51245 +280
==========================================
+ Hits 43641 43916 +275
- Misses 7324 7329 +5
Help us with your feedback. Take ten seconds to tell us how you rate us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool @xudong963 - thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR also needs tests --
Perhaps some unit tests in datafusion/optimizer/src/rewrite_disjunctive_predicate.rs and then a explain test for q18 showing the inner join?
I think the very nice code structure in datafusion/optimizer/src/rewrite_disjunctive_predicate.rs would make it quite easy to write unit tests.
Currently, q19 can't be converted to inner join, because the logic of
Yes, added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963 -- I think this PR is correct and could be merged as is. Very nice 👌
I left several suggestions on how to improve the code and I think the testing coverage could also be improved, but that could be done in a follow on PR, and I would be happy to help do so too
fn delete_duplicate_predicates(or_predicates: &[Predicate]) -> Predicate { | ||
let mut shortest_exprs: Vec<Predicate> = vec![]; | ||
let mut shortest_exprs_len = 0; | ||
// choose the shortest AND predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the need for checking the shortest AND predicate -- is there some test case that would show why picking this is important?
Or maybe another question is "why not check all elements?" Perhaps by keeping a set of expressions that were common (and checking each element in the set for its inclusion in all the arguments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the need for checking the shortest AND predicate
The shortest AND predicate could be the common expression to be extracted if each of its elements appears in all OR predicates.
why not check all elements
We don't need to check all elements, only the shortest could be the common expression.
keeping a set of expressions
Yes, shortest_exprs
should be a HashSet
, but to avoid implementing Eq
and Hash
(it'll be spread deeply), I use Vec
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to check all elements, only the shortest could be the common expression.
I don't understand why only the shortest could be the common expression. I am probably missing something.
For example, in the following predicate, the common sub expression(p_partkey = l_partkey OR p_partkey > 5)
is not the shortest
(
(p_partkey = l_partkey OR p_partkey > 5)
and p_brand = 'Brand#12'
)
or
(
(p_partkey = l_partkey OR p_partkey > 5)
and p_size between 1 and 10
)
or
(
(p_partkey = l_partkey OR p_partkey > 5)
and p_size between 1 and 15
)";
and yet it could be factored out
(p_partkey = l_partkey OR p_partkey > 5)
and
(
p_brand = 'Brand#12'
)
or
(
p_size between 1 and 10
)
or
(
p_size between 1 and 15
)";
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I don't think it is incorrect to pick the shortest predicate, but I do think it may miss potential rewrites. We can always improve it in the future perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in the following predicate, the common sub expression
(p_partkey = l_partkey OR p_partkey > 5)
is not the shortest
The common sub-expression is from the shortest AND predicate, but the shortest AND predicate is not equal to the common sub-expression(beside each element in the shortest AND predicate is in all the OR arguments.)
Thanks @alamb, learn much from your comments, and I'll address your comments in the PR later. We can keep it open. |
Marking as draft to signify more work is planned on this PR |
Sorry for the delay in updating, I'm busy with work recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @xudong963
fn delete_duplicate_predicates(or_predicates: &[Predicate]) -> Predicate { | ||
let mut shortest_exprs: Vec<Predicate> = vec![]; | ||
let mut shortest_exprs_len = 0; | ||
// choose the shortest AND predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to check all elements, only the shortest could be the common expression.
I don't understand why only the shortest could be the common expression. I am probably missing something.
For example, in the following predicate, the common sub expression(p_partkey = l_partkey OR p_partkey > 5)
is not the shortest
(
(p_partkey = l_partkey OR p_partkey > 5)
and p_brand = 'Brand#12'
)
or
(
(p_partkey = l_partkey OR p_partkey > 5)
and p_size between 1 and 10
)
or
(
(p_partkey = l_partkey OR p_partkey > 5)
and p_size between 1 and 15
)";
and yet it could be factored out
(p_partkey = l_partkey OR p_partkey > 5)
and
(
p_brand = 'Brand#12'
)
or
(
p_size between 1 and 10
)
or
(
p_size between 1 and 15
)";
🤔
})) | ||
} | ||
_ => { | ||
let expr = plan.expressions(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 very nice
} | ||
); | ||
let rewritten_predicate = rewrite_predicate(predicate); | ||
assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fn delete_duplicate_predicates(or_predicates: &[Predicate]) -> Predicate { | ||
let mut shortest_exprs: Vec<Predicate> = vec![]; | ||
let mut shortest_exprs_len = 0; | ||
// choose the shortest AND predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I don't think it is incorrect to pick the shortest predicate, but I do think it may miss potential rewrites. We can always improve it in the future perhaps
Co-authored-by: Andrew Lamb <[email protected]>
🚀 -- thanks again @xudong963 ! |
Benchmark runs are scheduled for baseline = 0f19990 and contender = 4005076. 4005076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Closes #217
Query plan for tpch 18, please focus on filter plan
We need to migrate the
cross join -> inner join optimization
from the planner to the optimizer so that tpch 19 can be further optimized to inner join using the predicate extracted byrewrite_disjunctive_predicate
. #2859