Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] #5531

abellina · 2022-05-18T22:58:40Z

This PR is a workaround to handle cases where we would go above the cuDF row limit when coalescing a build-side batch in the hash join, but we have a chance to potentially rescue things if the batch is built mostly of nulls. The real fix is to not materialize those rows with null keys (https://issues.apache.org/jira/browse/SPARK-39131), but that fix in the logical plan introduces other issues that we don't have an answer to yet.

Posting this as a draft to get some comments on the implementation. Specifically on the join types that this applies to, it would be nice to get another pair of eyes there.

I'll look into adding a test for this.

…imits Signed-off-by: Alessandro Bellina <[email protected]>

revans2 · 2022-05-19T13:54:55Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

+          val cb = if (inputFilterExpression.isDefined) {
+            // If we have reached the cuDF limit once, proactively filter batches
+            // after that first limit is reached.
+            GpuFilter.apply(cbFromIter, inputFilterExpression.get)


nit the apply is not needed.

GpuFilter(cbFromITer, inputFilterExpression.get)

this should be done.

revans2 · 2022-05-19T13:57:39Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

@@ -351,7 +373,14 @@ abstract class AbstractGpuCoalesceIterator(

      // there is a hard limit of 2^31 rows
      while (numRows < Int.MaxValue && !hasOnDeck && iter.hasNext) {
-        closeOnExcept(iter.next()) { cb =>
+        closeOnExcept(iter.next()) { cbFromIter =>


The closing of this appears to be off by a bit in error cases. GpuFilter.apply closes the input batch. Ideally we should rename it to make it clear what it is doing. So after that happens if an exception is thrown we will get a double close on cbFromIter. This is kind of minor, but the result of the filter cb now is left unprotected if an exception is thrown and would leak.

I have taken care of two of these cases, but I am still left with a potential double close on cb: 751dce0#diff-26bc5860b4878c986610d72135b63fedf0051e84e2a89c61a8df18aea942e139R417

I think the problem is GpuFilter and us relying on it to close things for us. We should have a version that does not close and then we can have clearly defined boundaries.

yes that makes sense. Will do

I tried this, but at the end really in this case we always filter + close except for this last case I was left with, which seems like the simplest solution without refactoring more code is to null out the batch after it gets consumed by the filter: 74e16c2. Let me know if you disagree @revans2

abellina · 2022-05-19T15:18:19Z

Ok, I may need to cover one more case, and that is where we still have GpuCoalesceBatches(GpuShuffleCoalesce(shuffle)) for the build side. I am seeing this can still show up in the plan, even though we are trying to replace these actively via shuffledHashJoinOptimizeShuffle.

I am looking into it. Before I PRed this I had also changed childrenCoalesceGoal in GpuShuffledHashJoinExec to use RequireSingleBatchWithFilter instead of RequireSingleBatch, which totally works. But I am confused why the plan looks the way it looks. The change to childrenCoalesceGoal may require changes to other pattern matching, or to the inheritance structure so that RequireSingleBatchWithFilter and RequireSingleBatch have a common trait.

abellina · 2022-05-19T17:14:09Z

Ok, the reason for what I am seeing with some GpuShuffledHashJoinExec still having a build-side with GpuCoalesceBatches(GpuShuffleCoalesce(shuffle)) is because the rule I added here (#4588) is too restrictive: https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala#L247, checking also that the left side has a coalesce batches.

…lter as a real goal

…lesce goal

abellina · 2022-05-20T05:21:21Z

build

abellina · 2022-05-20T13:54:59Z

triggering the databricks CI

abellina · 2022-05-20T13:55:06Z

build

abellina · 2022-05-23T13:15:12Z

build

…DF row limits [databricks] (NVIDIA#5531)" This reverts commit 5f33368.

Filter rows with null keys when coalescing due to reaching cuDF row l…

b9ca330

…imits Signed-off-by: Alessandro Bellina <[email protected]>

abellina force-pushed the join_coalesce_nulls branch from eb08d44 to b9ca330 Compare May 18, 2022 23:01

Fix typos

037f572

revans2 reviewed May 19, 2022

View reviewed changes

abellina added 4 commits May 19, 2022 12:45

Add RequireSingleBatchLike to be able to use RequireSingleBatchWithFi…

5084649

…lter as a real goal

GpuShuffledHashJoin can now use RequireSingleBatchWithFilter as a coa…

dec836f

…lesce goal

Remove extra comment

d17a587

Partially address code review comments

751dce0

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label May 20, 2022

Null out batch after filtering to preveng double closing

74e16c2

abellina marked this pull request as ready for review May 20, 2022 05:21

revans2 approved these changes May 20, 2022

View reviewed changes

jlowe added this to the May 2 - May 20 milestone May 20, 2022

abellina changed the title ~~Filter rows with null keys when coalescing due to reaching cuDF row l…~~ Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] May 20, 2022

jlowe approved these changes May 20, 2022

View reviewed changes

abellina mentioned this pull request May 20, 2022

[BUG] test_cast_neg_to_decimal_err failing in databricks #5577

Closed

Merge branch 'github_origin/branch-22.06' into join_coalesce_nulls

802fd8b

abellina merged commit 5f33368 into NVIDIA:branch-22.06 May 23, 2022

abellina deleted the join_coalesce_nulls branch May 23, 2022 18:27

tgravescs added a commit to tgravescs/spark-rapids that referenced this pull request Jun 1, 2022

Revert "Filter rows with null keys when coalescing due to reaching cu…

edcc295

…DF row limits [databricks] (NVIDIA#5531)" This reverts commit 5f33368.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] #5531

Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] #5531

abellina commented May 18, 2022 •

edited

Loading

revans2 May 19, 2022

abellina May 20, 2022

revans2 May 19, 2022

abellina May 19, 2022

revans2 May 19, 2022

abellina May 20, 2022

abellina May 20, 2022

abellina commented May 19, 2022

abellina commented May 19, 2022

abellina commented May 20, 2022

abellina commented May 20, 2022

abellina commented May 20, 2022

abellina commented May 23, 2022

Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] #5531

Filter rows with null keys when coalescing due to reaching cuDF row limits [databricks] #5531

Conversation

abellina commented May 18, 2022 • edited Loading

revans2 May 19, 2022

Choose a reason for hiding this comment

abellina May 20, 2022

Choose a reason for hiding this comment

revans2 May 19, 2022

Choose a reason for hiding this comment

abellina May 19, 2022

Choose a reason for hiding this comment

revans2 May 19, 2022

Choose a reason for hiding this comment

abellina May 20, 2022

Choose a reason for hiding this comment

abellina May 20, 2022

Choose a reason for hiding this comment

abellina commented May 19, 2022

abellina commented May 19, 2022

abellina commented May 20, 2022

abellina commented May 20, 2022

abellina commented May 20, 2022

abellina commented May 23, 2022

abellina commented May 18, 2022 •

edited

Loading