Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow duplicate field names in table join, fix output with duplicated names #1023

Merged
merged 2 commits into from
Sep 20, 2021

Conversation

houqp
Copy link
Member

@houqp houqp commented Sep 19, 2021

Rationale for this change

The current join constraint check is too strict, which causes false positives for valid join queries. Our current outer/full/right/left join logic is also not correct because it generates row values for both of the join columns from only one side of the join.

What changes are included in this PR?

  1. allow duplicate column names in check_join_set_is_valid
  2. move column index building logic into build_join_schema so it can support column with duplicated names. This should also result in minor performance improvement because we are not rebuilding the same column index when executing join on every single partition anymore.

Are there any user-facing changes?

no

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Sep 19, 2021
"| 3 | 7 | 9 | | 7 | |",
"| 1 | 4 | 7 | | | |",
"| 2 | 5 | 8 | | | |",
"| 3 | 7 | 9 | | | |",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these test fixes demonstrate the incorrect join behavior that this PR also fixes.

@@ -188,7 +182,8 @@ impl HashJoinExec {
let right_schema = right.schema();
check_join_is_valid(&left_schema, &right_schema, &on)?;

let schema = Arc::new(build_join_schema(&left_schema, &right_schema, join_type));
let (schema, column_indices) =
build_join_schema(&left_schema, &right_schema, join_type);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column indices are now created only once when we create the HashJoinExec node.

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice improvement to me - tests clearly demonstrate the fixes (I totally missed them before!)

@@ -229,38 +225,6 @@ impl HashJoinExec {
pub fn partition_mode(&self) -> &PartitionMode {
&self.mode
}

/// Calculates column indices and left/right placement on input / output schemas and jointype
fn column_indices_from_schema(&self) -> ArrowResult<Vec<ColumnIndex>> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore

left: &Schema,
right: &Schema,
join_type: &JoinType,
) -> (Schema, Vec<ColumnIndex>) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main change, we are now returning column index with schema in this function.

use std::sync::Arc;

use crate::logical_plan::JoinType;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved join specific code into join_utils.rs

@houqp
Copy link
Member Author

houqp commented Sep 20, 2021

@Dandandan pushed refactor based on your suggestions :) PTAL.

@alamb alamb changed the title fix: allow duplicate field names in table join fix: allow duplicate field names in table join, fix output with duplicated names Sep 20, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the code very carefully but I did review the test changes carefully and they all looked great to me.

Nice work @houqp ❤️

@@ -1375,7 +1344,7 @@ mod tests {
"| 1 | 4 | 7 | 10 | 4 | 70 |",
"| 2 | 5 | 8 | 20 | 5 | 80 |",
"| 2 | 5 | 8 | 20 | 5 | 80 |",
"| 3 | 7 | 9 | | 7 | |",
"| 3 | 7 | 9 | | | |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this and the answer after this PR appears to be correct (there is no value 7 in b1 on the right input 👍


let expected = vec![
"+---+---+---+----+---+----+",
"| a | b | c | a | b | c |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Dandandan Dandandan merged commit 65483d3 into apache:master Sep 20, 2021
@Dandandan
Copy link
Contributor

Thanks so much @houqp

@houqp houqp deleted the qp_join branch September 20, 2021 22:31
@houqp houqp added api change Changes the API exposed to users of the crate bug Something isn't working labels Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate bug Something isn't working datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants