-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix full outer result mismatch issue when output contains multiple matching rows #11068
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
28c8a14
to
8da6dd9
Compare
@JkSelf please let me know when it is ready for review. Also, please add summary explaining the exact problem and fix. Thanks! |
b13b661
to
30158db
Compare
@JkSelf thanks for looking into this. I'm still having a hard time to follow the code, what exactly was broken and how the proposed change addresses it. Could you elaborate more and maybe provide a few examples in the summary? We would also need to update the documentation of the structures and classes in the header file; if you update the documentation of the parts you are changing it may make it easier to review. |
30158db
to
c20a3f5
Compare
@pedroerp Add examples to describe the issue in the summary. And also add the comments in header file. Can you help to review again? Thanks. |
f634a8c
to
2611ffb
Compare
Thanks @JkSelf for the detailed explanation. I think now I understand the issue. It seems to me that the root of the problem is that we reuse the same JoinTracker pointer when doing both left or right joins. This works when you just need to keep track or hits/misses from one side, but not if you have a full joins and hence needs to keep track of both. Wouldn't, in this case, the solution be that we need to keep two JoinTrackers, one for left and one for right joins, the at applyFilter() have a bit of a special logic to make sure records are included or skipped based on whether they are results of misses or hits on either side? |
@pedroerp Thanks for your reply. Here, we only need to address the situation where the join key in a full outer join contains multiple duplicate records. In such cases, it either degenerates into a right join or a left join. If it is a left join, we can easily track the left join because the data is grouped sequentially. For example, 0-3 is the first group, 4-7 is the second group, and 8-11 is the third group.
When applying the filter, it is easy to add a new row for the missing rows and set all the columns on the right side to null. However, at this point, tracking the right join becomes more complex. First, based on the left join groups, we need to reshuffle the data. For example, (0, 4, 8) is the first group for the right join, (1, 5, 9) is the second group, (2, 6, 10) is the third group, and (3, 7, 11) is the fourth group. |
Assume the left table has columns a and b:
The right table has columns c and d:
The two tables are joined using a full outer join on the condition a == c and b < d. During the doGetOutput phase, the result is matched using a left join, resulting in 3 * 4 = 12 records:
Then, in the filter method, the records are filtered based on the condition b < d, resulting in the following:
Finally, records from the left table that do not have a match are filled with nulls, resulting in the following final output:
The above result is incorrect because it is missing rows from the right table that do not have a match. Among the 12 rows above, rows 0, 4, and 8 correspond to the first record (2, 3) from the right table, rows 1, 5, and 9 correspond to the second record (2, -1) from the right table, rows 2, 6, and 10 correspond to the third record (2, -1) from the right table, and rows 3, 7, and 11 correspond to the fourth record (2, 3) from the right table. From the matching results above, rows 1, 5, and 9, as well as rows 2, 6, and 10, are all false, meaning that the third and fourth records from the right table do not have matching rows. Therefore, the final result is missing rows from the right table that do not have matches. The correct final result should be:
This PR calls the filter function when the keys are the same to filter out rows from the right table that do not have matches. If a row from the right table does not have a match, a new record is inserted with the corresponding columns from the left table set to null.