-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use withRetry in GpuCoalesceBatches #7852
Conversation
Signed-off-by: Alessandro Bellina <[email protected]>
build |
1 similar comment
build |
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit and a question now.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala
Show resolved
Hide resolved
build |
I am looking into the build failure, it is the smoke test for UCX. |
So what is happening in the smoke test is we fail to come up with UCX because we fail to allocate bounce buffers. The cause is likely that a previous test is still on the GPU, taking all the memory and causing this failure. I don't have a solution yet... |
build |
build |
build |
@revans2 this should be ready to go in. Run is clean, and all the directories show no leaks or double closes. nvidia-smi shows that we ended up in a V100 @ 32GB so I think the multiple app theory causing a UCX OOM holds. |
Closes #7777
Closes #7855
This adds retry/retry+split semantics to the GpuCoalesceBatches iterators (two of them). It does not handle
HostToGpuCoalesceIterator
because the column builder code in cuDF needs some changes for it to be retriable. Note #7851.This introduces a
withRetryNoSplit
that doesn't take any arguments, it's simply a retry block:Additionally, this PR adjusts the parallelism for the UCX shuffle smoke test to 1, so we don't run the risk of OOMing.