[FEA] Implement OOM retry framework #7253

revans2 · 2022-12-05T20:54:01Z

Is your feature request related to a problem? Please describe.

Currently memory on the GPU is managed mostly by convention and by the GpuSemaphore. The GpuSemaphore allows a configured number of tasks onto the GPU at any one point in time, but it does not explicitly track or hand out memory to these tasks. By convention different execution paths will assume that they can use 4x the target batch size without any issues and also assume that the input batch size is <= the target batch size. There is also no way to request more memory if the operation knows that it will use more memory than is currently available.

Describe the solution you'd like
Create A GpuMemoryLeaseManager (GMLM) or update the GpuSemaphore to provide the following APIs.

def requestLease(tc: TaskContext, amount: long): MemoryLease
def getTotalLease(tc: TaskContext): Long
def getBaseLease(tc: TaskContext): Long // Not sure if this is needed getTotalLease probably is good enough.
def returnAllLeases(tc: TaskContext): Unit // release any outstanding leases

MemoryLease would be AutoClosable and would return the memory to the GMLM when it is closed.

The GMLM is an arbitrator. It is not intended to actually allocate any memory, just to reduce the load on the GPU if multiple operations would need more memory than is currently available. So for cases like a Join or a window operation where today we cannot guarantee that it will be under the 4x batch size limit. The goal is to eventually update all operators so that the limit is not by convention, but it a set value that can dynamically change if needed.

This is not intended to replace the efforts we have made for out of core algorithms. Those are still needed even on very large memory GPUs because CUDF still has column size limitations.

When a SparkPlan node wants to run on the GPU it will see what the current budget is by asking the GMLM. It will also estimate how much memory it will need to complete the current operation at hand. If the memory needed is more than current lease another lease on more memory will be requested. In order to make that request the SparkPlan node will need to make sure that all of the memory it is currently using is spillable.

When the GMLM receives a request and there is enough memory to fulfill the request it should provide a lease to the new task for the desired amount of memory ideally without blocking.

When there are more requests for memory than there is memory to fulfill the requests the GMLM will need to decide which tasks should be allowed to continue and which must wait. As this is not a simple problem to solve for the time being I would propose that we do a FIFO pattern, where the first task to ask for the memory is the first task to be allowed to run when there is enough memory available. As new tasks with new requests come in that cannot be satisfied all of their previously requested leases are made available to satisfy higher priority tasks. This is why all task memory must be made spillable before requesting a lease. When a lease is closed that memory will also be made available for pending tasks to use in FIFO/priority order. In the future we may have an explicit priority for a task, which would fit in well with this priority queue model.

If the total requested memory is more than the GPU could ever satisfy, then GMLM should treat the request as if it is asking for the entire GPU, and warn loudly that it is doing so. This is an attempt to let the task succeed on the chance that it overestimated the amount of memory that would be needed.

A task will automatically request a lease for 4 * target batch size when it acquires the GPU Semaphore. When the semaphore is released it will also release that original lease. This amount is for backwards compatibility with existing code that has this assumption hard coded. In the future this amount may change. The GMLM should be queried to see what the amount is, rather than go off of the target batch size.

The text was updated successfully, but these errors were encountered:

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 5, 2022

This was referenced Dec 5, 2022

[FEA] Avoid memory over usage on GPU nodes in the SparkPlan #7252

Closed

[FEA] Update GpuWindowExec to use OOM retry framework #7254

Closed

[FEA] replace GpuProjectExec.project API with one that returns an Iterator #7258

Open

mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Dec 6, 2022

mattahrens assigned revans2 Dec 8, 2022

sameerz removed the feature request New feature or request label Dec 9, 2022

revans2 mentioned this issue Dec 14, 2022

Add in a GpuMemoryLeaseManager #7361

Closed

mattahrens changed the title ~~[FEA] Write a GpuMemoryLeaseManager or update the GpuSemaphore to provide that functionality~~ [FEA] Implement OOO retry framework Jan 27, 2023

revans2 changed the title ~~[FEA] Implement OOO retry framework~~ [FEA] Implement OOM retry framework Jan 31, 2023

This was referenced Feb 3, 2023

Remove locking for spill and unspill IO #7666

Open

Research spill prioritization based on thread priority (retry framework) #7710

Open

Research spill prefetch when a low priority task may raise in priority (retry framework) #7711

Open

revans2 linked a pull request Mar 1, 2023 that will close this issue

Add in support for OOM retry #7822

Merged

revans2 closed this as completed in #7822 Mar 6, 2023

mattahrens added the feature request New feature or request label Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement OOM retry framework #7253

[FEA] Implement OOM retry framework #7253

revans2 commented Dec 5, 2022 •

edited

Loading

[FEA] Implement OOM retry framework #7253

[FEA] Implement OOM retry framework #7253

Comments

revans2 commented Dec 5, 2022 • edited Loading

revans2 commented Dec 5, 2022 •

edited

Loading