[FEA] Implement OOM retry framework #7253
Labels
feature request
New feature or request
reliability
Features to improve reliability or bugs that severly impact the reliability of the plugin
Is your feature request related to a problem? Please describe.
Currently memory on the GPU is managed mostly by convention and by the GpuSemaphore. The GpuSemaphore allows a configured number of tasks onto the GPU at any one point in time, but it does not explicitly track or hand out memory to these tasks. By convention different execution paths will assume that they can use 4x the target batch size without any issues and also assume that the input batch size is <= the target batch size. There is also no way to request more memory if the operation knows that it will use more memory than is currently available.
Describe the solution you'd like
Create A GpuMemoryLeaseManager (GMLM) or update the GpuSemaphore to provide the following APIs.
MemoryLease
would beAutoClosable
and would return the memory to the GMLM when it is closed.The GMLM is an arbitrator. It is not intended to actually allocate any memory, just to reduce the load on the GPU if multiple operations would need more memory than is currently available. So for cases like a Join or a window operation where today we cannot guarantee that it will be under the 4x batch size limit. The goal is to eventually update all operators so that the limit is not by convention, but it a set value that can dynamically change if needed.
This is not intended to replace the efforts we have made for out of core algorithms. Those are still needed even on very large memory GPUs because CUDF still has column size limitations.
When a SparkPlan node wants to run on the GPU it will see what the current budget is by asking the GMLM. It will also estimate how much memory it will need to complete the current operation at hand. If the memory needed is more than current lease another lease on more memory will be requested. In order to make that request the SparkPlan node will need to make sure that all of the memory it is currently using is spillable.
When the GMLM receives a request and there is enough memory to fulfill the request it should provide a lease to the new task for the desired amount of memory ideally without blocking.
When there are more requests for memory than there is memory to fulfill the requests the GMLM will need to decide which tasks should be allowed to continue and which must wait. As this is not a simple problem to solve for the time being I would propose that we do a FIFO pattern, where the first task to ask for the memory is the first task to be allowed to run when there is enough memory available. As new tasks with new requests come in that cannot be satisfied all of their previously requested leases are made available to satisfy higher priority tasks. This is why all task memory must be made spillable before requesting a lease. When a lease is closed that memory will also be made available for pending tasks to use in FIFO/priority order. In the future we may have an explicit priority for a task, which would fit in well with this priority queue model.
If the total requested memory is more than the GPU could ever satisfy, then GMLM should treat the request as if it is asking for the entire GPU, and warn loudly that it is doing so. This is an attempt to let the task succeed on the chance that it overestimated the amount of memory that would be needed.
A task will automatically request a lease for 4 * target batch size when it acquires the GPU Semaphore. When the semaphore is released it will also release that original lease. This amount is for backwards compatibility with existing code that has this assumption hard coded. In the future this amount may change. The GMLM should be queried to see what the amount is, rather than go off of the target batch size.
The text was updated successfully, but these errors were encountered: