fix: allow duplicate field names in table join, fix output with duplicated names #1023

houqp · 2021-09-19T23:47:29Z

Rationale for this change

The current join constraint check is too strict, which causes false positives for valid join queries. Our current outer/full/right/left join logic is also not correct because it generates row values for both of the join columns from only one side of the join.

What changes are included in this PR?

allow duplicate column names in check_join_set_is_valid
move column index building logic into build_join_schema so it can support column with duplicated names. This should also result in minor performance improvement because we are not rebuilding the same column index when executing join on every single partition anymore.

Are there any user-facing changes?

no

houqp · 2021-09-19T23:48:28Z

datafusion/src/physical_plan/hash_join.rs

-            "| 3  | 7  | 9  |    | 7  |    |",
+            "| 1  | 4  | 7  |    |    |    |",
+            "| 2  | 5  | 8  |    |    |    |",
+            "| 3  | 7  | 9  |    |    |    |",


these test fixes demonstrate the incorrect join behavior that this PR also fixes.

houqp · 2021-09-19T23:49:53Z

datafusion/src/physical_plan/hash_join.rs

@@ -188,7 +182,8 @@ impl HashJoinExec {
        let right_schema = right.schema();
        check_join_is_valid(&left_schema, &right_schema, &on)?;

-        let schema = Arc::new(build_join_schema(&left_schema, &right_schema, join_type));
+        let (schema, column_indices) =
+            build_join_schema(&left_schema, &right_schema, join_type);


column indices are now created only once when we create the HashJoinExec node.

datafusion/src/physical_plan/hash_utils.rs

Dandandan

Looks like a nice improvement to me - tests clearly demonstrate the fixes (I totally missed them before!)

datafusion/src/physical_plan/hash_utils.rs

houqp · 2021-09-20T16:41:08Z

datafusion/src/physical_plan/hash_join.rs

@@ -229,38 +225,6 @@ impl HashJoinExec {
    pub fn partition_mode(&self) -> &PartitionMode {
        &self.mode
    }
-
-    /// Calculates column indices and left/right placement on input / output schemas and jointype
-    fn column_indices_from_schema(&self) -> ArrowResult<Vec<ColumnIndex>> {


not needed anymore

houqp · 2021-09-20T16:41:52Z

datafusion/src/physical_plan/join_utils.rs

+    left: &Schema,
+    right: &Schema,
+    join_type: &JoinType,
+) -> (Schema, Vec<ColumnIndex>) {


this is the main change, we are now returning column index with schema in this function.

houqp · 2021-09-20T16:42:27Z

datafusion/src/physical_plan/hash_utils.rs

 use std::sync::Arc;

-use crate::logical_plan::JoinType;


moved join specific code into join_utils.rs

houqp · 2021-09-20T16:43:11Z

@Dandandan pushed refactor based on your suggestions :) PTAL.

alamb

I didn't review the code very carefully but I did review the test changes carefully and they all looked great to me.

Nice work @houqp ❤️

alamb · 2021-09-20T18:22:35Z

datafusion/src/physical_plan/hash_join.rs

@@ -1375,7 +1344,7 @@ mod tests {
            "| 1  | 4  | 7  | 10 | 4  | 70 |",
            "| 2  | 5  | 8  | 20 | 5  | 80 |",
            "| 2  | 5  | 8  | 20 | 5  | 80 |",
-            "| 3  | 7  | 9  |    | 7  |    |",
+            "| 3  | 7  | 9  |    |    |    |",


I double checked this and the answer after this PR appears to be correct (there is no value 7 in b1 on the right input 👍

alamb · 2021-09-20T18:23:00Z

datafusion/src/physical_plan/hash_join.rs

+
+        let expected = vec![
+            "+---+---+---+----+---+----+",
+            "| a | b | c | a  | b | c  |",


Dandandan · 2021-09-20T19:48:55Z

Thanks so much @houqp

fix: allow duplicate field names in table join

c0082d7

houqp requested review from Dandandan, alamb, andygrove and jorgecarleitao September 19, 2021 23:47

github-actions bot added the datafusion Changes in the datafusion crate label Sep 19, 2021

houqp commented Sep 19, 2021

View reviewed changes

houqp mentioned this pull request Sep 19, 2021

Alias issues on join roapi/roapi#78

Closed

Dandandan reviewed Sep 20, 2021

View reviewed changes

datafusion/src/physical_plan/hash_utils.rs Outdated Show resolved Hide resolved

Dandandan approved these changes Sep 20, 2021

View reviewed changes

Dandandan reviewed Sep 20, 2021

View reviewed changes

datafusion/src/physical_plan/hash_utils.rs Outdated Show resolved Hide resolved

Dandandan reviewed Sep 20, 2021

View reviewed changes

datafusion/src/physical_plan/hash_utils.rs Outdated Show resolved Hide resolved

move join related code into join_utils.rs

5de761f

houqp commented Sep 20, 2021

View reviewed changes

Dandandan approved these changes Sep 20, 2021

View reviewed changes

alamb changed the title ~~fix: allow duplicate field names in table join~~ fix: allow duplicate field names in table join, fix output with duplicated names Sep 20, 2021

alamb approved these changes Sep 20, 2021

View reviewed changes

Dandandan merged commit 65483d3 into apache:master Sep 20, 2021

houqp deleted the qp_join branch September 20, 2021 22:31

houqp added api change Changes the API exposed to users of the crate bug Something isn't working labels Sep 20, 2021

alamb mentioned this pull request May 4, 2022

Fix bug in subquery join filters referencing outer query #2416

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: allow duplicate field names in table join, fix output with duplicated names #1023

fix: allow duplicate field names in table join, fix output with duplicated names #1023

houqp commented Sep 19, 2021

houqp Sep 19, 2021

houqp Sep 19, 2021

Dandandan left a comment

houqp Sep 20, 2021

houqp Sep 20, 2021

houqp Sep 20, 2021

houqp commented Sep 20, 2021

alamb left a comment

alamb Sep 20, 2021

alamb Sep 20, 2021

Dandandan commented Sep 20, 2021

fix: allow duplicate field names in table join, fix output with duplicated names #1023

fix: allow duplicate field names in table join, fix output with duplicated names #1023

Conversation

houqp commented Sep 19, 2021

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

houqp Sep 19, 2021

Choose a reason for hiding this comment

houqp Sep 19, 2021

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

houqp Sep 20, 2021

Choose a reason for hiding this comment

houqp Sep 20, 2021

Choose a reason for hiding this comment

houqp Sep 20, 2021

Choose a reason for hiding this comment

houqp commented Sep 20, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 20, 2021

Choose a reason for hiding this comment

alamb Sep 20, 2021

Choose a reason for hiding this comment

Dandandan commented Sep 20, 2021