-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JOIN conditions are order dependent #778
Conversation
.map(|c| c.normalize(&self.plan)) | ||
.collect::<Result<_>>() | ||
.or_else(|_| { | ||
swap = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't "swapping" be done on an individual level, e.g. for left1=right1 and right2=left2
it will swap everything now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for you review. I will address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored this to operate at an individual condition level and I think the code is much more succinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome back @seddonm1 ! It is great to see a contribution from you. Thank you very much
@@ -1754,51 +1779,70 @@ async fn equijoin_and_unsupported_condition() -> Result<()> { | |||
#[tokio::test] | |||
async fn left_join() -> Result<()> { | |||
let mut ctx = create_join_context("t1_id", "t2_id")?; | |||
let sql = "SELECT t1_id, t1_name, t2_name FROM t1 LEFT JOIN t2 ON t1_id = t2_id ORDER BY t1_id"; | |||
let actual = execute(&mut ctx, sql).await; | |||
let equivalent_sql = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this test case and those below it add any additional coverage compared to
async fn equijoin_multiple_condition_ordering() -> Result<()> {
Like are there bugs that would cause equijoin_multiple_condition_ordering
to pass but the others to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main objective was to ensure nothing specific to equijoin (INNER
) would break this implementation for the other join types (which occurred when developing it). This may be more relevant if any further optimisations occur.
.into_iter() | ||
.map(|c| c.into().normalize(right)) | ||
.collect::<Result<_>>()?; | ||
let (left_keys, right_keys): (Vec<Result<Column>>, Vec<Result<Column>>) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this code -- if LogicalPlanBuilder::join
is called it has two inputs, a left (self
) and a right (right
) so the idea that left_keys
might contain references to right
is confusing to me.
I wonder if there might be a better place to fix up the errors -- perhaps somewhere further up in the call stack thatn LogicalPlanBuilder::join
-- for example extract_join_keys
in the SQL planner (especially since this bug seems related to SQL planning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome back @seddonm1 ! It is great to see a contribution from you. Thank you very much
Thanks @alamb. Unfortunately my employment situation makes it hard to contribute too much but I am still very interested in seeing Datafusion succeed.
The LogicalPlanBuilder::join
has left
and right
keys but they are left
and right
relative to a join condition not to a left
or right
relation.
The left_keys
and right_keys
are where the join conditions are resolved to underlying relations. The Postgres behavior does not require a specific order of conditions for joining so we need to be able to handle both conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the argument name makes sense with the old assumption that left_keys should only refer to columns from the left plan. With this fix, it can become confusing.
Perhaps a better signature for this function would be something along the lines of: join_keys: Vec<(impl Into<Column>, impl Into<Column>)>
.
If we want the dataframe API to also support order independent JOIN conditions, then it is indeed better to address the fix within this builder method. Otherwise I agree it would be better to fix it in the SQL planner instead. I am leaning towards supporting this in dataframe API as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the signature to be join_keys: Vec<(impl Into<Column>, impl Into<Column>)>.
instead of left_keys:...
and right_keys: ...
makes sense to me
@alamb @houqp I have updated the PR with @alamb suggestion of a Regarding @houqp dataframe suggestion, I am not sure how we proceed; I would expect users of the dataframe API to be much more capable of adapting to logically specifying keys in the 'correct' order vs SQL users who just want it to behave like any other SQL engine? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks. @seddonm1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
Regarding @houqp dataframe suggestion, I am not sure how we proceed; I would expect users of the dataframe API to be much more capable of adapting to logically specifying keys in the 'correct' order vs SQL users who just want it to behave like any other SQL engine?
I agree. If it ever becomes a problem in the future, we can address it with minimal code change since this PR has already done all the heavy lifting.
🎉 thanks again @seddonm1 ! |
Which issue does this PR close?
Closes #777.
Rationale for this change
Currently the Logical Plan builder assumes that the join conditions fields are provided in the same order as the tables are specified. This is different behavior to how Postgres works which does not care about the order of the tables listed in the condition. This PR aims to rectify this discrepancy.
What changes are included in this PR?
This change allows the conditions to be reversed and will fail as previously if neither relation contains the column. A lot of the tests in
sql.rs
have been updated to test both conditions.Are there any user-facing changes?
This does not change public APIs, and should not be noticeable by any current users.