You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a dev build, I added a Cuda.deviceSynchronize when we are unable to handle a failed allocation (we can't spill anymore) in DeviceMemoryEventHandler.onAllocFailure and I had a customer query pass consistently after doing this.
This is one such example that shows 15GB allocated, and us unable to allocate 2.6GB in a 40GB pool:
Device store exhausted, unable to allocate 2570310640 bytes. Total RMM allocated is 15015998208 bytes
After the synchronize, a retry succeeded. This could be related to internals of how the ASYNC allocator works or it could also be partly due to how the limiting memory resource and the async memory resource (which is wrapped by the limiting memory resource) are not in sync. It is worth noting that I tried a similar with with ARENA, I still run OOM, but a synchronize doesn't seem to be helping in that case.
So, I believe there is value in making a change either in the plugin or in RmmJni.cpp.
Essentially what I think we want is:
Try to handle the failure via spilling the regular way (this could be also tweaked, see below).
We'll need to reset the limit if the job starts succeeding again without help. That said, there could be many threads failing and spilling and several concurrent retries. So I think one approach may be a time based retry. If the last time we failed for any thread is more than say 1 or 2 seconds, we can retry again. Otherwise, we know we retried and we are still retrying, it is probably best to be done. Perhaps we can keep a retry counter per task thread, that seems like an option as well. I'd like to hear input on preferences to the above.
I believe we should experiment with synchronize before we try to spill to see if (1) above could be prevented in some cases. If we are seeing fragmentation that can be handled via a sync, it's probably cheaper to sync than to spill in some scenarios.
The text was updated successfully, but these errors were encountered:
This adds the method `boolean onAllocFailure(long sizeRequested, int retryCount)` to `RmmEventHandler`, to help handling code keep track of the number of times an allocation failure has been retried.
With this code callers can perform extra logic that depends on whether the callback was due to a brand new allocation failure, or one that has failed in the past and is being retried.
This will be used here: NVIDIA/spark-rapids#6768
Authors:
- Alessandro Bellina (https://github.com/abellina)
Approvers:
- Jason Lowe (https://github.com/jlowe)
- Nghia Truong (https://github.com/ttnghia)
URL: #11940
In a dev build, I added a
Cuda.deviceSynchronize
when we are unable to handle a failed allocation (we can't spill anymore) inDeviceMemoryEventHandler.onAllocFailure
and I had a customer query pass consistently after doing this.This is one such example that shows 15GB allocated, and us unable to allocate 2.6GB in a 40GB pool:
After the synchronize, a retry succeeded. This could be related to internals of how the ASYNC allocator works or it could also be partly due to how the limiting memory resource and the async memory resource (which is wrapped by the limiting memory resource) are not in sync. It is worth noting that I tried a similar with with ARENA, I still run OOM, but a synchronize doesn't seem to be helping in that case.
So, I believe there is value in making a change either in the plugin or in
RmmJni.cpp
.Essentially what I think we want is:
a. If we haven't retried this allocation more than N times (some sort of limit, or time based limit),
Cuda.deviceSynchronize
and then allow the calling code to retry.b. If we have gone above our retry limits, we can let the executor fail as before.
We'll need to reset the limit if the job starts succeeding again without help. That said, there could be many threads failing and spilling and several concurrent retries. So I think one approach may be a time based retry. If the last time we failed for any thread is more than say 1 or 2 seconds, we can retry again. Otherwise, we know we retried and we are still retrying, it is probably best to be done. Perhaps we can keep a retry counter per task thread, that seems like an option as well. I'd like to hear input on preferences to the above.
I believe we should experiment with synchronize before we try to spill to see if (1) above could be prevented in some cases. If we are seeing fragmentation that can be handled via a sync, it's probably cheaper to sync than to spill in some scenarios.
The text was updated successfully, but these errors were encountered: