Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all equality predicates in equality join #4193

Merged
merged 12 commits into from
Nov 17, 2022

Conversation

ygf11
Copy link
Contributor

@ygf11 ygf11 commented Nov 13, 2022

Which issue does this PR close?

Closes #4140.

Rationale for this change

Currently datafusion will parse some joins as cross join when left or right join key contain normal expression.
For example, select * from test0 INNER JOIN test1 ON test0.c0 + 1 = test1.c1 * 2.

To improve performance, these joins should also parse as join.

What changes are included in this PR?

  • Support extract expression join key in extract_join_keys.
  • Add projection for input if needs when create logical plan of join.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sql SQL Planner labels Nov 13, 2022
@mingmwang
Copy link
Contributor

What's the difference between extract_join_keys() and extract_possible_join_keys()?
Not sure whether the two methods can be combined or not.
From my understanding, I think the extract process should be

  1. Convert the expr/predicate to CNF (I think we already had the utils)
  2. Split the CNF to Vec of expr/predicate (I think we already had the utils)
  3. For each expr in the split exprs, check if is a BinaryExpr with Operator::Eq and the left and right are coming from different input schemas.

@alamb @jackwener

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 14, 2022

Thank you @mingmwang.

Not sure whether the two methods can be combined or not.

The extract_join_keys is used to generate explicit join plan, and the extract_possible_join_keys is used to optimize selection clause from cross join to inner join.

I think the logics of generating two join plan are similar, so we can try to combine them next.

The extract_possible_join_keys will be removed in pr:

From my understanding, I think the extract process should be

Yes, you are right. The first two was done by original extract_join_keys, and I add the third to extract_join_keys in this pr.

@ygf11 ygf11 marked this pull request as ready for review November 14, 2022 09:01
@ygf11
Copy link
Contributor Author

ygf11 commented Nov 14, 2022

The failed ci is not relative to this pr.

@Dandandan @alamb @mingmwang PTAL.

@alamb alamb changed the title Support normal expressions in equality join Support all equality predicates in equality join Nov 15, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ygf11 -- I reviewed this PR fairly carefully and it looks quite good to me as does the test coverage. I had a question about non INNER joins but otherwise this PR is ready to merge from my perspective

Thank you very much

@@ -2837,36 +2861,75 @@ fn remove_join_expressions(
/// foo = bar => accum=[(foo, bar)] accum_filter=[]
/// foo = bar AND bar = baz => accum=[(foo, bar), (bar, baz)] accum_filter=[]
/// foo = bar AND baz > 1 => accum=[(foo, bar)] accum_filter=[baz > 1]
///
/// For normal expression join key, assume we have tables -- a(c0, c1 c2) and b(c0, c1, c2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of join key is typically called an "equijoin" rather than "normal expression join key"

Thank you for these comments

datafusion/sql/src/planner.rs Outdated Show resolved Hide resolved

let expected = "Projection: person.id, orders.order_id\
\n Inner Join: person.id + Int64(10) = orders.customer_id * Int64(2)\
\n Projection: person.id, person.first_name, person.last_name, person.age, person.state, person.salary, person.birth_date, person.😀, person.id + Int64(10)\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@mingmwang mingmwang Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, do not get chance to take a look at the PR.
@ygf11 Could you please add one more test case :

        let sql = "SELECT *
            FROM person \
            INNER JOIN orders \
            ON orders.customer_id * 2 = person.id + 10"

This is to verify the new added projection will not introduce additional columns in the final result and make sure those temp projected columns will be trimmed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see you have logic to trim the temp projected columns.

Copy link
Contributor Author

@ygf11 ygf11 Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see you have logic to trim the temp projected columns.

Yes, for select *, the final result will contains the additional columns.

Do you means we need to add a new projection to trim the temp projected columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be removed.

\n TableScan: orders";
quick_test(sql, expected);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice test !

As I recall this type of transformation can not always be done for all types of joins (though perhaps it is ok for equijoins)

For example, I wonder if it is correct to always pull expressions below the join for FULL OUTER JOINs 🤔

Can you add a test for something like

 let sql = "SELECT id, order_id \
            FROM person \
            FULL OUTER JOIN orders \
            ON person.id = 10";

if it doesn't already exist?

In that case the predicates need to be evaluated within the join (if it is done as a filter afterwards it may filter rows that should not be filtered)

Copy link
Contributor Author

@ygf11 ygf11 Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let sql = "SELECT id, order_id
FROM person
FULL OUTER JOIN orders
ON person.id = 10";

currently I transform this sql to cross join as master does, but as you explained it may filter rows that should not be filtered, we should do same for this sql.

Unfortunately, one test case will fail caused by different data type of join keys, after I add the change.
https://github.com/apache/arrow-datafusion/blob/406c1087bc16f8d2a49e5a9b05d2a0e1b67f7aa5/datafusion/core/tests/sql/joins.rs#L369-L385

I think we can fix it after we add type_coercion for join in optimizer rule.

Any way, I add the full join test case.

Copy link
Contributor

@mingmwang mingmwang Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I wonder if it is correct to always pull expressions below the join for FULL OUTER JOINs 🤔

Can you add a test for something like

 let sql = "SELECT id, order_id \
            FROM person \
            FULL OUTER JOIN orders \
            ON person.id = 10";

if it doesn't already exist?

In that case the predicates need to be evaluated within the join (if it is done as a filter afterwards it may filter rows that should not be filtered)

I think it is OK to projection such kind expressions, the type of the new projected column type is bool, the bool value is evaluated again during the join.

One thing that I am not clear now and whether we should cover in this PR and can do in the following PR is
Should we allow the following cases ?

// Join conditions include suspicious Exprs

SELECT id, order_id 
               FROM person 
               FULL OUTER JOIN orders 
               ON person.id = person.name = order.id

// Join conditions include non-deterministic Exprs

SELECT id, order_id 
               FROM person 
               FULL OUTER JOIN orders 
               ON person.id = person.name || random(xxx)

@alamb
Copy link
Contributor

alamb commented Nov 15, 2022

I restarted the failed CI job for this PR

@mingmwang
Copy link
Contributor

Please also add test case to verify this SQL to make sure the PR does not add unnecessary projection Exprs.

let sql = "SELECT  orders.customer_id * 2,  person.id + 10
            FROM person 
            INNER JOIN orders 
            ON orders.customer_id * 2 = person.id + 10"

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 17, 2022

Please also add test case to verify this SQL to make sure the PR does not add unnecessary projection Exprs.

Added as test_select_join_key_inner_join.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- than you @ygf11 and @mingmwang


#[test]
fn test_one_side_constant_full_join() {
// TODO: this sql should be parsed as join after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 822022d into apache:master Nov 17, 2022
@ursabot
Copy link

ursabot commented Nov 17, 2022

Benchmark runs are scheduled for baseline = 062acad and contender = 822022d. 822022d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java


// Remove temporary projected columns if necessary.
if left_projected || right_projected {
let final_join_result = join_schema
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trim the temp projected columns here, and the test case is test_select_all_inner_join.
cc @mingmwang

.iter()
.flat_map(|expr| expr.try_into_col().is_err().then_some(expr))
.cloned()
.collect::<HashSet<Expr>>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When projection items contain duplicated expressions, the sql will failed.
So I do deduplication here, relative test case is also added.

cc @alamb @mingmwang

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 18, 2022

Sorry, I forget to submit the review.
Then some description are in pending status, you do not see them.

@alamb @mingmwang, any comment are welcome, I can fix them in another pr. Thank you.

@alamb
Copy link
Contributor

alamb commented Nov 18, 2022

Thanks @ygf11 -- I think it looks good to me; Thanks again for all your help

@ygf11 ygf11 deleted the expr-eq-join branch November 20, 2022 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support more expressions in equality join
4 participants