-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: allow duplicate field names in table join, fix output with duplicated names #1023
Conversation
"| 3 | 7 | 9 | | 7 | |", | ||
"| 1 | 4 | 7 | | | |", | ||
"| 2 | 5 | 8 | | | |", | ||
"| 3 | 7 | 9 | | | |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these test fixes demonstrate the incorrect join behavior that this PR also fixes.
@@ -188,7 +182,8 @@ impl HashJoinExec { | |||
let right_schema = right.schema(); | |||
check_join_is_valid(&left_schema, &right_schema, &on)?; | |||
|
|||
let schema = Arc::new(build_join_schema(&left_schema, &right_schema, join_type)); | |||
let (schema, column_indices) = | |||
build_join_schema(&left_schema, &right_schema, join_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column indices are now created only once when we create the HashJoinExec node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice improvement to me - tests clearly demonstrate the fixes (I totally missed them before!)
@@ -229,38 +225,6 @@ impl HashJoinExec { | |||
pub fn partition_mode(&self) -> &PartitionMode { | |||
&self.mode | |||
} | |||
|
|||
/// Calculates column indices and left/right placement on input / output schemas and jointype | |||
fn column_indices_from_schema(&self) -> ArrowResult<Vec<ColumnIndex>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed anymore
left: &Schema, | ||
right: &Schema, | ||
join_type: &JoinType, | ||
) -> (Schema, Vec<ColumnIndex>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main change, we are now returning column index with schema in this function.
use std::sync::Arc; | ||
|
||
use crate::logical_plan::JoinType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved join specific code into join_utils.rs
@Dandandan pushed refactor based on your suggestions :) PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the code very carefully but I did review the test changes carefully and they all looked great to me.
Nice work @houqp ❤️
@@ -1375,7 +1344,7 @@ mod tests { | |||
"| 1 | 4 | 7 | 10 | 4 | 70 |", | |||
"| 2 | 5 | 8 | 20 | 5 | 80 |", | |||
"| 2 | 5 | 8 | 20 | 5 | 80 |", | |||
"| 3 | 7 | 9 | | 7 | |", | |||
"| 3 | 7 | 9 | | | |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked this and the answer after this PR appears to be correct (there is no value 7
in b1
on the right input 👍
|
||
let expected = vec![ | ||
"+---+---+---+----+---+----+", | ||
"| a | b | c | a | b | c |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks so much @houqp |
Rationale for this change
The current join constraint check is too strict, which causes false positives for valid join queries. Our current outer/full/right/left join logic is also not correct because it generates row values for both of the join columns from only one side of the join.
What changes are included in this PR?
check_join_set_is_valid
build_join_schema
so it can support column with duplicated names. This should also result in minor performance improvement because we are not rebuilding the same column index when executing join on every single partition anymore.Are there any user-facing changes?
no