-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add oom retry handling for createGatherer in gpu hash joins #7902
Conversation
Signed-off-by: Jim Brennan <[email protected]>
b450971
to
4b2fa30
Compare
These are the integration tests that leak column vectors with this code:
|
...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala
Outdated
Show resolved
Hide resolved
...rc/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExecBase.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala
Outdated
Show resolved
Hide resolved
build |
} | ||
opTime.ns { | ||
withResource(cb) { cb => | ||
val numJoinRows = computeNumJoinRows(cb) | ||
withResource(scb) { scb => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run tests I see double close exceptions here. I would have to do some more testing to see why this is causing it. I think there has to be a situation where scb is getting close on an exception in something that is being called withing this range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing it! I am trying to repro this.
} | ||
withResource(splits) { splits => | ||
val schema = GpuColumnVector.extractTypes(cb) | ||
val tables = splits.map(_.getTable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to close the tables that are returned here? And if so then this should be a safeMap too. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is covered by the withResource(splits). These tables are closed when we close the ContiguousTables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug here - thanks!
I am going to split this draft PR into two parts. I will use this one for the |
Testing I have done so far includes:
|
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this looks great. I just want to run a few tests locally myself too
I am seeing a few failures related to close being called too many times. Specifically when running NDS query 40 with only 4 GiB of memory. I saw 9 failures related to this and all of them has the exact same stack trace. If you need or want more help debugging this please let me know.
|
Thanks @revans2! I will see if I can reproduce this. |
@revans2 I have merged up to include your latest changes, and added a possible fix for the double-close issue. Can you please try again with this version when you get a chance? I'm still having trouble reproducing locally. |
build |
build |
…ves a stale cached value
@revans2 I was finally able to reproduce the inc-after-close, double-close by forcing an oom during the checkpoint of the stream batch. Moving the The other part of the fix is in I think this might fix the double-close reported in #7581 Thanks again for testing this and providing logs to help me debug it. |
build |
1 similar comment
build |
Looks good. I am seeing a few leaked columns when I do my testing with really low memory, but none of them are even close to the join code, so I think we are good. I'll try to track the others down and file something separately for them |
This adds oom retry handling for gpu hash joins. It is part of the work described in #7255.
This uses the RmmRapidsRetryIterator framework to add retries for createGatherer`.
For
createGatherer
, some refactoring was done to ensure the buffers are spillable before we go into a retry loop. There are two paths for the stream side - the originalLazySpillableColumnar
batch from the stream-side iterator, and the pending list (from prior splits), which was aSpillableColumnarBatch
. For this pr, I chose to make all of theseLazySpillableColumnarBatches
, because that is what we ultimately hand off to the gatherer.I am putting this up as a draft because I have not added unit tests yet, and there is still a bug in this code that causes us to leak column vectors in a few of the broadcast-nested-loop-join integration tests (and when running nds).
I also need to do a performance impact check.