Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix full outer result mismatch issue when output contains multiple matching rows #11068

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented Sep 23, 2024

Assume the left table has columns a and b:

a       b
2	100
2	1
2	1

The right table has columns c and d:

c       d
2	3
2	-1
2	-1
2	3

The two tables are joined using a full outer join on the condition a == c and b < d. During the doGetOutput phase, the result is matched using a left join, resulting in 3 * 4 = 12 records:

No      a        b       c      d
0	2	100	 2	3
1	2	100	 2	-1
2	2	100	 2	-1
3	2	100	 2	3
4	2	1	 2	3
5	2	1	 2	-1
6	2	1	 2	-1
7	2	1	 2	3
8	2	1	 2	3
9	2	1	 2	-1
10	2	1	 2	-1
11	2	1	 2	3

Then, in the filter method, the records are filtered based on the condition b < d, resulting in the following:

No	a	b	c	d	matched
0	2	100	2	3	FALSE
1	2	100	2	-1	FALSE
2	2	100	2	-1	FALSE
3	2	100	2	3	FALSE
4	2	1	2	3	TRUE
5	2	1	2	-1	FALSE
6	2	1	2	-1	FALSE
7	2	1	2	3	TRUE
8	2	1	2	3	TRUE
9	2	1	2	-1	FALSE
10	2	1	2	-1	FALSE
11	2	1	2	3	TRUE

Finally, records from the left table that do not have a match are filled with nulls, resulting in the following final output:

No	a	b	c	 d
0	2	100	null null
1	2	1	2	 3
2	2	1	2	 3
3	2	1	2	 3
4	2	1	2	 3

The above result is incorrect because it is missing rows from the right table that do not have a match. Among the 12 rows above, rows 0, 4, and 8 correspond to the first record (2, 3) from the right table, rows 1, 5, and 9 correspond to the second record (2, -1) from the right table, rows 2, 6, and 10 correspond to the third record (2, -1) from the right table, and rows 3, 7, and 11 correspond to the fourth record (2, 3) from the right table. From the matching results above, rows 1, 5, and 9, as well as rows 2, 6, and 10, are all false, meaning that the third and fourth records from the right table do not have matching rows. Therefore, the final result is missing rows from the right table that do not have matches. The correct final result should be:

No	a	b	c	 d
0	2	100	null   null
1	2	1	2	 3
2	2	1	2	 3
3	2	1	2	 3
4	2	1	2	 3
5       null    null    2       -1
6       null    null    2       -1

This PR calls the filter function when the keys are the same to filter out rows from the right table that do not have matches. If a row from the right table does not have a match, a new record is inserted with the corresponding columns from the left table set to null.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 23, 2024
Copy link

netlify bot commented Sep 23, 2024

Deploy Preview for meta-velox ready!

Name Link
🔨 Latest commit 2611ffb
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/672c28a726dd0d0008228ed6
😎 Deploy Preview https://deploy-preview-11068--meta-velox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@pedroerp
Copy link
Contributor

@JkSelf please let me know when it is ready for review. Also, please add summary explaining the exact problem and fix. Thanks!

@JkSelf JkSelf force-pushed the full-outer-result-mismatch branch 2 times, most recently from b13b661 to 30158db Compare September 25, 2024 07:48
@JkSelf
Copy link
Collaborator Author

JkSelf commented Sep 25, 2024

@JkSelf please let me know when it is ready for review. Also, please add summary explaining the exact problem and fix. Thanks!

@pedroerp This PR is ready to review. And i also add the problem and fix description. Can you help to review again? Thanks.

@JkSelf JkSelf changed the title [WIP] Fix full outer result mismatch Fix full outer result mismatch issue Sep 25, 2024
@pedroerp
Copy link
Contributor

pedroerp commented Oct 1, 2024

@JkSelf thanks for looking into this. I'm still having a hard time to follow the code, what exactly was broken and how the proposed change addresses it. Could you elaborate more and maybe provide a few examples in the summary?

We would also need to update the documentation of the structures and classes in the header file; if you update the documentation of the parts you are changing it may make it easier to review.

@JkSelf JkSelf changed the title Fix full outer result mismatch issue Fix full outer result mismatch issue when output contains multiple matching rows Oct 9, 2024
@JkSelf
Copy link
Collaborator Author

JkSelf commented Oct 9, 2024

@JkSelf thanks for looking into this. I'm still having a hard time to follow the code, what exactly was broken and how the proposed change addresses it. Could you elaborate more and maybe provide a few examples in the summary?

We would also need to update the documentation of the structures and classes in the header file; if you update the documentation of the parts you are changing it may make it easier to review.

@pedroerp Add examples to describe the issue in the summary. And also add the comments in header file. Can you help to review again? Thanks.

@pedroerp
Copy link
Contributor

Thanks @JkSelf for the detailed explanation. I think now I understand the issue.

It seems to me that the root of the problem is that we reuse the same JoinTracker pointer when doing both left or right joins. This works when you just need to keep track or hits/misses from one side, but not if you have a full joins and hence needs to keep track of both.

Wouldn't, in this case, the solution be that we need to keep two JoinTrackers, one for left and one for right joins, the at applyFilter() have a bit of a special logic to make sure records are included or skipped based on whether they are results of misses or hits on either side?

@JkSelf
Copy link
Collaborator Author

JkSelf commented Nov 22, 2024

@pedroerp Thanks for your reply.

Here, we only need to address the situation where the join key in a full outer join contains multiple duplicate records. In such cases, it either degenerates into a right join or a left join. If it is a left join, we can easily track the left join because the data is grouped sequentially. For example, 0-3 is the first group, 4-7 is the second group, and 8-11 is the third group.

No      a        b       c      d
0	2	100	 2	3
1	2	100	 2	-1
2	2	100	 2	-1
3	2	100	 2	3
4	2	1	 2	3
5	2	1	 2	-1
6	2	1	 2	-1
7	2	1	 2	3
8	2	1	 2	3
9	2	1	 2	-1
10	2	1	 2	-1
11	2	1	 2	3

When applying the filter, it is easy to add a new row for the missing rows and set all the columns on the right side to null. However, at this point, tracking the right join becomes more complex. First, based on the left join groups, we need to reshuffle the data. For example, (0, 4, 8) is the first group for the right join, (1, 5, 9) is the second group, (2, 6, 10) is the third group, and (3, 7, 11) is the fourth group.
With such grouping, it is difficult to traverse the output when applying the filter. Similarly, if we follow the right join, we will encounter the same issue with tracking the left join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. merge-join
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge Join Full Outer Join with filter is not filtering on some equality clause matches
3 participants