Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter nulls from joins where possible to improve performance #754

Merged
merged 2 commits into from
Sep 12, 2020

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Sep 12, 2020

This fixes #569

I pulled out the performance fix #594 yesterday because I wanted to be sure that all of the queries were correct for the 0.2 release and I wasn't sure I could find the root cause of the issue before the release.

I found it. When doing a join each incoming table has the join keys extracted from it and then the join is performed. The null filtering was intended to go after the keys were extracted and before the actual join. On the build side I had done that, but on the stream side I had inserted the null filtering before the keys were extracted. That ended up causing, in some cases, an unrelated column to have all of the nulls removed from it. fa1215b has the fix. I move the null filtering to after the project on the stream side and it cleans up the code a bit more too.

There is a lot of code changes in here, but some of it is done 3 times because of the shim layer.

…A#594)

* Filter nulls from joins where possible to improve performance.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

* Addressed review comments

Signed-off-by: Robert (Bobby) Evans <[email protected]>

* Updated patch for other shims
@revans2 revans2 added bug Something isn't working performance A performance related task/issue labels Sep 12, 2020
@revans2 revans2 requested a review from jlowe September 12, 2020 18:48
@revans2 revans2 self-assigned this Sep 12, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Sep 12, 2020

build

@revans2 revans2 merged commit ea95ff3 into NVIDIA:branch-0.2 Sep 12, 2020
@revans2 revans2 deleted the try_again branch September 12, 2020 19:52
JustPlay pushed a commit to JustPlay/spark-rapids that referenced this pull request Sep 13, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#754)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] left_semi_join operation is abnormal and serious time-consuming
2 participants