Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oom retry handling for createGatherer in gpu hash joins #7902

Merged
merged 15 commits into from
Apr 3, 2023

Conversation

jbrennan333
Copy link
Contributor

@jbrennan333 jbrennan333 commented Mar 20, 2023

This adds oom retry handling for gpu hash joins. It is part of the work described in #7255.

This uses the RmmRapidsRetryIterator framework to add retries for createGatherer`.

For createGatherer, some refactoring was done to ensure the buffers are spillable before we go into a retry loop. There are two paths for the stream side - the original LazySpillableColumnar batch from the stream-side iterator, and the pending list (from prior splits), which was a SpillableColumnarBatch. For this pr, I chose to make all of these LazySpillableColumnarBatches, because that is what we ultimately hand off to the gatherer.

I am putting this up as a draft because I have not added unit tests yet, and there is still a bug in this code that causes us to leak column vectors in a few of the broadcast-nested-loop-join integration tests (and when running nds).
I also need to do a performance impact check.

@jbrennan333 jbrennan333 self-assigned this Mar 20, 2023
@jbrennan333 jbrennan333 added feature request New feature or request improve reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Mar 20, 2023
@jbrennan333
Copy link
Contributor Author

jbrennan333 commented Mar 20, 2023

These are the integration tests that leak column vectors with this code:
Current integration test failures:

../../src/main/python/join_test.py::test_right_broadcast_nested_loop_join_with_ast_condition[Inner-Boolean-1g][IGNORE_ORDER({'local': True})] 23/03/20 15:13:57 ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 6506 7fae5c361f80)

../../src/main/python/join_test.py::test_right_broadcast_nested_loop_join_with_ast_condition[LeftSemi-Boolean-1g][IGNORE_ORDER({'local': True})] 23/03/20 15:14:03 ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 17027 7fae6814f480)

@jbrennan333
Copy link
Contributor Author

build

}
opTime.ns {
withResource(cb) { cb =>
val numJoinRows = computeNumJoinRows(cb)
withResource(scb) { scb =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run tests I see double close exceptions here. I would have to do some more testing to see why this is causing it. I think there has to be a situation where scb is getting close on an exception in something that is being called withing this range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing it! I am trying to repro this.

}
withResource(splits) { splits =>
val schema = GpuColumnVector.extractTypes(cb)
val tables = splits.map(_.getTable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to close the tables that are returned here? And if so then this should be a safeMap too. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is covered by the withResource(splits). These tables are closed when we close the ContiguousTables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug here - thanks!

@jbrennan333
Copy link
Contributor Author

I am going to split this draft PR into two parts. I will use this one for the createGatherer and related changes.
I have filed #7930 to address the nextCbFromGatherer changes. I am going to focus on #7930 first, because it is a smaller change and I think has more impact (it covers the ai.rapids.cudf.Table.gather case and the getRowsInNextBatch case.

@jbrennan333 jbrennan333 changed the title Add oom retry handling for gpu hash joins Add oom retry handling for createGatherer in gpu hash joins Mar 24, 2023
@jbrennan333
Copy link
Contributor Author

Testing I have done so far includes:

  • I have verified correct results for a power run on my desktop at scale 100 with forced retries in both createGatherer implementations.
  • I have run the full join integration test suite with forced retry ooms in both createGatherer implementation, and there were no failures.
  • I ran a performance check on spark2a and there was a small change:
Name = benchmark
Means = 487719.9972629547, 489457.8136444092
Time diff = -1737.8163814544678
Speedup = 0.9964495073262494

@jbrennan333 jbrennan333 marked this pull request as ready for review March 30, 2023 13:15
@jbrennan333
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes Mar 30, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this looks great. I just want to run a few tests locally myself too

@revans2
Copy link
Collaborator

revans2 commented Mar 30, 2023

I am seeing a few failures related to close being called too many times. Specifically when running NDS query 40 with only 4 GiB of memory. I saw 9 failures related to this and all of them has the exact same stack trace. If you need or want more help debugging this please let me know.

23/03/30 16:40:43 WARN TaskSetManager: Lost task 8.0 in stage 900.0 (TID 3628) (10.28.9.123 executor 0): java.lang.IllegalStateException: Close called too many times ColumnVector{rows=22865493, type=INT32, nullCount=Optional.empty, offHea
p=(ID: 957712 0)}
        at ai.rapids.cudf.ColumnVector.close(ColumnVector.java:224)
        at com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:1124)
        at org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:44)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$close$1(JoinGatherer.scala:311)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.$anonfun$close$1$adapted(JoinGatherer.scala:311)
        at scala.Option.foreach(Option.scala:407)
        at com.nvidia.spark.rapids.LazySpillableColumnarBatchImpl.close(JoinGatherer.scala:311)
        at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableColumn.safeClose(implicits.scala:56)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:54)
        at com.nvidia.spark.rapids.SplittableJoinIterator.$anonfun$setupNextGatherer$1(AbstractGpuJoinIterator.scala:214)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:131)
        at com.nvidia.spark.rapids.SplittableJoinIterator.setupNextGatherer(AbstractGpuJoinIterator.scala:214)
        at com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:96)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)

@jbrennan333
Copy link
Contributor Author

Thanks @revans2! I will see if I can reproduce this.

@jbrennan333 jbrennan333 linked an issue Mar 31, 2023 that may be closed by this pull request
5 tasks
@jbrennan333
Copy link
Contributor Author

@revans2 I have merged up to include your latest changes, and added a possible fix for the double-close issue. Can you please try again with this version when you get a chance? I'm still having trouble reproducing locally.

@jbrennan333
Copy link
Contributor Author

build

@jbrennan333
Copy link
Contributor Author

build

@abellina abellina self-requested a review March 31, 2023 22:07
@jbrennan333
Copy link
Contributor Author

@revans2 I was finally able to reproduce the inc-after-close, double-close by forcing an oom during the checkpoint of the stream batch.

Moving the checkpoint calls (which call allowSpilling for LazySpillableColumnarBatch) outside of the try block was part of the fix - I no longer get the inc-after-close, but I was still hitting the double-close.

The other part of the fix is in LazySpillableColumnarBatchImpl.allowSpilling(). It needs to ensure it clears the cached value if we throw while creating the SpillableColumnarBatch. I made this change for LazySpillableGatherMapImpl.allowSpilling() as well.

I think this might fix the double-close reported in #7581

Thanks again for testing this and providing logs to help me debug it.

@jbrennan333
Copy link
Contributor Author

build

1 similar comment
@jbrennan333
Copy link
Contributor Author

build

@revans2
Copy link
Collaborator

revans2 commented Apr 3, 2023

Looks good. I am seeing a few leaked columns when I do my testing with really low memory, but none of them are even close to the join code, so I think we are good. I'll try to track the others down and file something separately for them

@revans2 revans2 merged commit 0b6dd14 into NVIDIA:branch-23.04 Apr 3, 2023
@jbrennan333
Copy link
Contributor Author

Thanks for the reviews and testing @revans2 and @abellina!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improve reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework
3 participants