[FEA] Add synchronize and retry before OOM with async allocator #6768

abellina · 2022-10-12T17:51:27Z

In a dev build, I added a Cuda.deviceSynchronize when we are unable to handle a failed allocation (we can't spill anymore) in DeviceMemoryEventHandler.onAllocFailure and I had a customer query pass consistently after doing this.

This is one such example that shows 15GB allocated, and us unable to allocate 2.6GB in a 40GB pool:

Device store exhausted, unable to allocate 2570310640 bytes. Total RMM allocated is 15015998208 bytes

After the synchronize, a retry succeeded. This could be related to internals of how the ASYNC allocator works or it could also be partly due to how the limiting memory resource and the async memory resource (which is wrapped by the limiting memory resource) are not in sync. It is worth noting that I tried a similar with with ARENA, I still run OOM, but a synchronize doesn't seem to be helping in that case.

So, I believe there is value in making a change either in the plugin or in RmmJni.cpp.

Essentially what I think we want is:

Try to handle the failure via spilling the regular way (this could be also tweaked, see below).
When we can't spill anymore (https://github.com/NVIDIA/spark-rapids/blob/branch-22.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala#L52):
a. If we haven't retried this allocation more than N times (some sort of limit, or time based limit), Cuda.deviceSynchronize and then allow the calling code to retry.
b. If we have gone above our retry limits, we can let the executor fail as before.

We'll need to reset the limit if the job starts succeeding again without help. That said, there could be many threads failing and spilling and several concurrent retries. So I think one approach may be a time based retry. If the last time we failed for any thread is more than say 1 or 2 seconds, we can retry again. Otherwise, we know we retried and we are still retrying, it is probably best to be done. Perhaps we can keep a retry counter per task thread, that seems like an option as well. I'd like to hear input on preferences to the above.

I believe we should experiment with synchronize before we try to spill to see if (1) above could be prevented in some cases. If we are seeing fragmentation that can be handled via a sync, it's probably cheaper to sync than to spill in some scenarios.

The text was updated successfully, but these errors were encountered:

This adds the method `boolean onAllocFailure(long sizeRequested, int retryCount)` to `RmmEventHandler`, to help handling code keep track of the number of times an allocation failure has been retried. With this code callers can perform extra logic that depends on whether the callback was due to a brand new allocation failure, or one that has failed in the past and is being retried. This will be used here: NVIDIA/spark-rapids#6768 Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #11940

abellina added feature request New feature or request ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 12, 2022

This was referenced Oct 12, 2022

[TASK] Run without fatal OOMs #6746

Closed

[FEA] Should we synchronize and then spill with the ASYNC allocator #6769

Open

mattahrens assigned abellina Oct 14, 2022

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022

This was referenced Oct 18, 2022

[JNI] Adds retryCount to RmmEventHandler.onAllocFailure rapidsai/cudf#11940

Merged

Cuda.deviceSynchronize as a last resort if we cannot spill enough #6849

Merged

abellina closed this as completed in #6849 Oct 20, 2022

sameerz removed the feature request New feature or request label Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add synchronize and retry before OOM with async allocator #6768

[FEA] Add synchronize and retry before OOM with async allocator #6768

abellina commented Oct 12, 2022

[FEA] Add synchronize and retry before OOM with async allocator #6768

[FEA] Add synchronize and retry before OOM with async allocator #6768

Comments

abellina commented Oct 12, 2022