-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support all equality predicates in equality join #4193
Conversation
What's the difference between
|
Thank you @mingmwang.
The
Yes, you are right. The first two was done by original |
The failed ci is not relative to this pr. @Dandandan @alamb @mingmwang PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ygf11 -- I reviewed this PR fairly carefully and it looks quite good to me as does the test coverage. I had a question about non INNER joins but otherwise this PR is ready to merge from my perspective
Thank you very much
datafusion/sql/src/planner.rs
Outdated
@@ -2837,36 +2861,75 @@ fn remove_join_expressions( | |||
/// foo = bar => accum=[(foo, bar)] accum_filter=[] | |||
/// foo = bar AND bar = baz => accum=[(foo, bar), (bar, baz)] accum_filter=[] | |||
/// foo = bar AND baz > 1 => accum=[(foo, bar)] accum_filter=[baz > 1] | |||
/// | |||
/// For normal expression join key, assume we have tables -- a(c0, c1 c2) and b(c0, c1, c2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of join key is typically called an "equijoin" rather than "normal expression join key"
Thank you for these comments
datafusion/sql/src/planner.rs
Outdated
|
||
let expected = "Projection: person.id, orders.order_id\ | ||
\n Inner Join: person.id + Int64(10) = orders.customer_id * Int64(2)\ | ||
\n Projection: person.id, person.first_name, person.last_name, person.age, person.state, person.salary, person.birth_date, person.😀, person.id + Int64(10)\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, do not get chance to take a look at the PR.
@ygf11 Could you please add one more test case :
let sql = "SELECT *
FROM person \
INNER JOIN orders \
ON orders.customer_id * 2 = person.id + 10"
This is to verify the new added projection will not introduce additional columns in the final result and make sure those temp projected columns will be trimmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see you have logic to trim the temp projected columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see you have logic to trim the temp projected columns.
Yes, for select *
, the final result will contains the additional columns.
Do you means we need to add a new projection to trim the temp projected columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they should be removed.
\n TableScan: orders"; | ||
quick_test(sql, expected); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice test !
As I recall this type of transformation can not always be done for all types of joins (though perhaps it is ok for equijoins)
For example, I wonder if it is correct to always pull expressions below the join for FULL OUTER JOINs 🤔
Can you add a test for something like
let sql = "SELECT id, order_id \
FROM person \
FULL OUTER JOIN orders \
ON person.id = 10";
if it doesn't already exist?
In that case the predicates need to be evaluated within the join (if it is done as a filter afterwards it may filter rows that should not be filtered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let sql = "SELECT id, order_id
FROM person
FULL OUTER JOIN orders
ON person.id = 10";
currently I transform this sql
to cross join
as master does, but as you explained it may filter rows that should not be filtered, we should do same for this sql.
Unfortunately, one test case will fail caused by different data type of join keys, after I add the change.
https://github.com/apache/arrow-datafusion/blob/406c1087bc16f8d2a49e5a9b05d2a0e1b67f7aa5/datafusion/core/tests/sql/joins.rs#L369-L385
I think we can fix it after we add type_coercion
for join in optimizer rule.
Any way, I add the full join test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, I wonder if it is correct to always pull expressions below the join for FULL OUTER JOINs 🤔
Can you add a test for something like
let sql = "SELECT id, order_id \ FROM person \ FULL OUTER JOIN orders \ ON person.id = 10";
if it doesn't already exist?
In that case the predicates need to be evaluated within the join (if it is done as a filter afterwards it may filter rows that should not be filtered)
I think it is OK to projection such kind expressions, the type of the new projected column type is bool, the bool value is evaluated again during the join.
One thing that I am not clear now and whether we should cover in this PR and can do in the following PR is
Should we allow the following cases ?
// Join conditions include suspicious Exprs
SELECT id, order_id
FROM person
FULL OUTER JOIN orders
ON person.id = person.name = order.id
// Join conditions include non-deterministic Exprs
SELECT id, order_id
FROM person
FULL OUTER JOIN orders
ON person.id = person.name || random(xxx)
I restarted the failed CI job for this PR |
Please also add test case to verify this SQL to make sure the PR does not add unnecessary projection Exprs.
|
Added as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great -- than you @ygf11 and @mingmwang
|
||
#[test] | ||
fn test_one_side_constant_full_join() { | ||
// TODO: this sql should be parsed as join after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Benchmark runs are scheduled for baseline = 062acad and contender = 822022d. 822022d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
||
// Remove temporary projected columns if necessary. | ||
if left_projected || right_projected { | ||
let final_join_result = join_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trim the temp projected columns here, and the test case is test_select_all_inner_join
.
cc @mingmwang
.iter() | ||
.flat_map(|expr| expr.try_into_col().is_err().then_some(expr)) | ||
.cloned() | ||
.collect::<HashSet<Expr>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When projection items contain duplicated expressions, the sql will failed.
So I do deduplication here, relative test case is also added.
cc @alamb @mingmwang
Sorry, I forget to submit the review. @alamb @mingmwang, any comment are welcome, I can fix them in another pr. Thank you. |
Thanks @ygf11 -- I think it looks good to me; Thanks again for all your help |
Which issue does this PR close?
Closes #4140.
Rationale for this change
Currently datafusion will parse some joins as
cross join
when left or right join key contain normal expression.For example,
select * from test0 INNER JOIN test1 ON test0.c0 + 1 = test1.c1 * 2
.To improve performance, these joins should also parse as
join
.What changes are included in this PR?
extract_join_keys
.Are these changes tested?
Yes
Are there any user-facing changes?