Add oom retry handling for createGatherer in gpu hash joins #7902

jbrennan333 · 2023-03-20T14:41:07Z

This adds oom retry handling for gpu hash joins. It is part of the work described in #7255.

This uses the RmmRapidsRetryIterator framework to add retries for createGatherer`.

For createGatherer, some refactoring was done to ensure the buffers are spillable before we go into a retry loop. There are two paths for the stream side - the original LazySpillableColumnar batch from the stream-side iterator, and the pending list (from prior splits), which was a SpillableColumnarBatch. For this pr, I chose to make all of these LazySpillableColumnarBatches, because that is what we ultimately hand off to the gatherer.

I am putting this up as a draft because I have not added unit tests yet, and there is still a bug in this code that causes us to leak column vectors in a few of the broadcast-nested-loop-join integration tests (and when running nds).
I also need to do a performance impact check.

Signed-off-by: Jim Brennan <[email protected]>

jbrennan333 · 2023-03-20T15:27:02Z

These are the integration tests that leak column vectors with this code:
Current integration test failures:

../../src/main/python/join_test.py::test_right_broadcast_nested_loop_join_with_ast_condition[Inner-Boolean-1g][IGNORE_ORDER({'local': True})] 23/03/20 15:13:57 ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 6506 7fae5c361f80)

../../src/main/python/join_test.py::test_right_broadcast_nested_loop_join_with_ast_condition[LeftSemi-Boolean-1g][IGNORE_ORDER({'local': True})] 23/03/20 15:14:03 ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 17027 7fae6814f480)

...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala

...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala

jbrennan333 · 2023-03-20T21:33:42Z

build

revans2 · 2023-03-21T15:58:53Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala

      }
      opTime.ns {
-        withResource(cb) { cb =>
-          val numJoinRows = computeNumJoinRows(cb)
+        withResource(scb) { scb =>


When I run tests I see double close exceptions here. I would have to do some more testing to see why this is causing it. I think there has to be a situation where scb is getting close on an exception in something that is being called withing this range.

Thanks for testing it! I am trying to repro this.

revans2 · 2023-03-21T16:34:51Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala

+    }
+    withResource(splits) { splits =>
+      val schema = GpuColumnVector.extractTypes(cb)
+      val tables = splits.map(_.getTable)


Don't we need to close the tables that are returned here? And if so then this should be a safeMap too. Right?

I think this is covered by the withResource(splits). These tables are closed when we close the ContiguousTables.

There was a bug here - thanks!

jbrennan333 · 2023-03-24T14:12:30Z

I am going to split this draft PR into two parts. I will use this one for the createGatherer and related changes.
I have filed #7930 to address the nextCbFromGatherer changes. I am going to focus on #7930 first, because it is a smaller change and I think has more impact (it covers the ai.rapids.cudf.Table.gather case and the getRowsInNextBatch case.

jbrennan333 · 2023-03-30T13:15:11Z

Testing I have done so far includes:

I have verified correct results for a power run on my desktop at scale 100 with forced retries in both createGatherer implementations.
I have run the full join integration test suite with forced retry ooms in both createGatherer implementation, and there were no failures.
I ran a performance check on spark2a and there was a small change:

Name = benchmark
Means = 487719.9972629547, 489457.8136444092
Time diff = -1737.8163814544678
Speedup = 0.9964495073262494

jbrennan333 · 2023-03-30T13:15:45Z

build

revans2

All of this looks great. I just want to run a few tests locally myself too

revans2 · 2023-03-30T17:48:05Z

I am seeing a few failures related to close being called too many times. Specifically when running NDS query 40 with only 4 GiB of memory. I saw 9 failures related to this and all of them has the exact same stack trace. If you need or want more help debugging this please let me know.

23/03/30 16:40:43 WARN TaskSetManager: Lost task 8.0 in stage 900.0 (TID 3628) (10.28.9.123 executor 0): java.lang.IllegalStateException: Close called too many times ColumnVector{rows=22865493, type=INT32, nullCount=Optional.empty, offHea
p=(ID: 957712 0)}
        at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:224)
        at com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:1124)
        at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:44)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$close$1(JoinGatherer.scala:311)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$close$1$adapted(JoinGatherer.scala:311)
        at scala.Option.foreach(Option.scala:407)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.close(JoinGatherer.scala:311)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:56)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:54)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$1(AbstractGpuJoinIterator.scala:214)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:131)
        at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:214)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:96)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)

jbrennan333 · 2023-03-30T18:07:56Z

Thanks @revans2! I will see if I can reproduce this.

jbrennan333 · 2023-03-31T16:28:01Z

@revans2 I have merged up to include your latest changes, and added a possible fix for the double-close issue. Can you please try again with this version when you get a chance? I'm still having trouble reproducing locally.

jbrennan333 · 2023-03-31T16:38:03Z

build

…erer

jbrennan333 · 2023-03-31T21:27:43Z

build

…ves a stale cached value

jbrennan333 · 2023-03-31T22:40:00Z

@revans2 I was finally able to reproduce the inc-after-close, double-close by forcing an oom during the checkpoint of the stream batch.

Moving the checkpoint calls (which call allowSpilling for LazySpillableColumnarBatch) outside of the try block was part of the fix - I no longer get the inc-after-close, but I was still hitting the double-close.

The other part of the fix is in LazySpillableColumnarBatchImpl.allowSpilling(). It needs to ensure it clears the cached value if we throw while creating the SpillableColumnarBatch. I made this change for LazySpillableGatherMapImpl.allowSpilling() as well.

I think this might fix the double-close reported in #7581

Thanks again for testing this and providing logs to help me debug it.

jbrennan333 · 2023-03-31T22:40:18Z

build

jbrennan333 · 2023-03-31T22:43:07Z

build

revans2 · 2023-04-03T14:09:42Z

Looks good. I am seeing a few leaked columns when I do my testing with really low memory, but none of them are even close to the join code, so I think we are good. I'll try to track the others down and file something separately for them

jbrennan333 · 2023-04-03T14:11:32Z

Thanks for the reviews and testing @revans2 and @abellina!

Initial change for gather.next

392f2cf

jbrennan333 self-assigned this Mar 20, 2023

jbrennan333 added feature request New feature or request improve reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Mar 20, 2023

Initial implementation for createGatherer

4b2fa30

Signed-off-by: Jim Brennan <[email protected]>

jbrennan333 force-pushed the jtb-join-retry-oom branch from b450971 to 4b2fa30 Compare March 20, 2023 14:44

jbrennan333 commented Mar 20, 2023

View reviewed changes

...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala Outdated Show resolved Hide resolved

jbrennan333 commented Mar 20, 2023

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

jbrennan333 commented Mar 20, 2023

View reviewed changes

...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala Outdated Show resolved Hide resolved

Address problem with joinGatherer possibly closing the stream batch

b1d6f80

jbrennan333 commented Mar 20, 2023

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala Outdated Show resolved Hide resolved

revans2 reviewed Mar 21, 2023

View reviewed changes

fix comment

d636d7f

jbrennan333 mentioned this pull request Mar 24, 2023

Add OOM Retry handling for join gather next #7930

Merged

jbrennan333 added 2 commits March 24, 2023 09:51

Merge branch 'branch-23.04' into jtb-join-retry-oom

b286fb8

Remove the nextCbFromGatherer change, which was moved to pr 7930

a6db62e

jbrennan333 changed the title ~~Add oom retry handling for gpu hash joins~~ Add oom retry handling for createGatherer in gpu hash joins Mar 24, 2023

Merge branch 'branch-23.04' into jtb-join-retry-oom

4b528b2

jbrennan333 mentioned this pull request Mar 28, 2023

Add CheckpointRestore trait and withRestoreOnRetry #7958

Merged

jbrennan333 added 3 commits March 29, 2023 09:32

Merge branch 'branch-23.04' into jtb-join-retry-oom

a250947

Update to use withRestoreOnRetry

2a9f81d

Fix leak of column vector

5c135eb

jbrennan333 marked this pull request as ready for review March 30, 2023 13:15

Merge branch 'branch-23.04' into jtb-join-retry-oom

f2aa6cb

jbrennan333 mentioned this pull request Mar 30, 2023

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

Closed

5 tasks

revans2 previously approved these changes Mar 30, 2023

View reviewed changes

Merge branch 'branch-23.04' into jtb-join-retry-oom

9ade058

jbrennan333 linked an issue Mar 31, 2023 that may be closed by this pull request

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

Closed

5 tasks

Possible fix for double close

307a73e

jbrennan333 dismissed revans2’s stale review via 307a73e March 31, 2023 16:24

Move checkpoints outside of try/catch block in GpuHashJoin.createGath…

cfcf3d5

…erer

abellina self-requested a review March 31, 2023 22:07

If allowSpilling throws while creating SpillableColumnarBatch, it lea…

34ec210

…ves a stale cached value

revans2 approved these changes Apr 3, 2023

View reviewed changes

abellina approved these changes Apr 3, 2023

View reviewed changes

revans2 merged commit 0b6dd14 into NVIDIA:branch-23.04 Apr 3, 2023

jbrennan333 mentioned this pull request Apr 4, 2023

[BUG] INC AFTER CLOSE for ColumnVector during shutdown in the join code #7581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add oom retry handling for createGatherer in gpu hash joins #7902

Add oom retry handling for createGatherer in gpu hash joins #7902

jbrennan333 commented Mar 20, 2023 •

edited

Loading

jbrennan333 commented Mar 20, 2023 •

edited

Loading

jbrennan333 commented Mar 20, 2023

revans2 Mar 21, 2023

jbrennan333 Mar 21, 2023

revans2 Mar 21, 2023

jbrennan333 Mar 21, 2023

jbrennan333 Mar 30, 2023

jbrennan333 commented Mar 24, 2023

jbrennan333 commented Mar 30, 2023

jbrennan333 commented Mar 30, 2023

revans2 left a comment

revans2 commented Mar 30, 2023

jbrennan333 commented Mar 30, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

revans2 commented Apr 3, 2023

jbrennan333 commented Apr 3, 2023

Add oom retry handling for createGatherer in gpu hash joins #7902

Add oom retry handling for createGatherer in gpu hash joins #7902

Conversation

jbrennan333 commented Mar 20, 2023 • edited Loading

jbrennan333 commented Mar 20, 2023 • edited Loading

jbrennan333 commented Mar 20, 2023

revans2 Mar 21, 2023

Choose a reason for hiding this comment

jbrennan333 Mar 21, 2023

Choose a reason for hiding this comment

revans2 Mar 21, 2023

Choose a reason for hiding this comment

jbrennan333 Mar 21, 2023

Choose a reason for hiding this comment

jbrennan333 Mar 30, 2023

Choose a reason for hiding this comment

jbrennan333 commented Mar 24, 2023

jbrennan333 commented Mar 30, 2023

jbrennan333 commented Mar 30, 2023

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Mar 30, 2023

jbrennan333 commented Mar 30, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

jbrennan333 commented Mar 31, 2023

revans2 commented Apr 3, 2023

jbrennan333 commented Apr 3, 2023

jbrennan333 commented Mar 20, 2023 •

edited

Loading

jbrennan333 commented Mar 20, 2023 •

edited

Loading