Filter nulls from joins where possible to improve performance #754

revans2 · 2020-09-12T18:48:13Z

This fixes #569

I pulled out the performance fix #594 yesterday because I wanted to be sure that all of the queries were correct for the 0.2 release and I wasn't sure I could find the root cause of the issue before the release.

I found it. When doing a join each incoming table has the join keys extracted from it and then the join is performed. The null filtering was intended to go after the keys were extracted and before the actual join. On the build side I had done that, but on the stream side I had inserted the null filtering before the keys were extracted. That ended up causing, in some cases, an unrelated column to have all of the nulls removed from it. fa1215b has the fix. I move the null filtering to after the project on the stream side and it cleans up the code a bit more too.

There is a lot of code changes in here, but some of it is done 3 times because of the shim layer.

…A#594) * Filter nulls from joins where possible to improve performance. Signed-off-by: Robert (Bobby) Evans <[email protected]> * Addressed review comments Signed-off-by: Robert (Bobby) Evans <[email protected]> * Updated patch for other shims

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2020-09-12T18:48:27Z

build

…#754) Signed-off-by: Robert (Bobby) Evans <[email protected]>

…IDIA#754) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>

revans2 added 2 commits September 12, 2020 09:19

Fixed issue with filtering before the key project and not after

fa1215b

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 added bug Something isn't working performance A performance related task/issue labels Sep 12, 2020

revans2 requested a review from jlowe September 12, 2020 18:48

revans2 self-assigned this Sep 12, 2020

jlowe approved these changes Sep 12, 2020

View reviewed changes

revans2 merged commit ea95ff3 into NVIDIA:branch-0.2 Sep 12, 2020

revans2 deleted the try_again branch September 12, 2020 19:52

JustPlay pushed a commit to JustPlay/spark-rapids that referenced this pull request Sep 13, 2020

Filter nulls from joins where possible to improve performance (NVIDIA…

274ec7c

…#754) Signed-off-by: Robert (Bobby) Evans <[email protected]>

chenrui17 mentioned this pull request Jan 14, 2021

[BUG] OutOfMemoryError - Maximum pool size exceeded while running 24 day criteo ETL Transform stage #1274

Closed

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Filter nulls from joins where possible to improve performance (NVIDIA…

9f8e33e

…#754) Signed-off-by: Robert (Bobby) Evans <[email protected]>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Filter nulls from joins where possible to improve performance (NVIDIA…

0d989a1

…#754) Signed-off-by: Robert (Bobby) Evans <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter nulls from joins where possible to improve performance #754

Filter nulls from joins where possible to improve performance #754

revans2 commented Sep 12, 2020

revans2 commented Sep 12, 2020

Filter nulls from joins where possible to improve performance #754

Filter nulls from joins where possible to improve performance #754

Conversation

revans2 commented Sep 12, 2020

revans2 commented Sep 12, 2020