[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818

abellina · 2022-02-17T22:39:34Z

When we register a buffer against the spill framework, the CPU may not be synchronized with the GPU stream that is adding the buffer. The GPU stream that is doing the freeing is also not synchronized. This is by design, and it is a good feature to have. We are tracking handles to GPU memory, but the state of the GPU streams is not something we are as concerned with at registration time.

At spill time that's a different story. Since spills happen from a task thread that ran OOM, the freeing CUDA stream may be (often is) different than the allocating stream (the stream that registered the buffer to begin with). This may mean we ask RMM or CUDA to free a pointer that isn't quite done.

In the recent move to ASYNC allocator it looks like it can be even worse: we could allocate a buffer asynchronously and turn around to spill it before it is done allocating (in the extreme).

Given PTDS, we can't reach at the allocating stream to introduce a synchronize, since the application only knows about a stream Id that is automatically managed by CUDA (it is the same ID for all threads from the application's point of view). A thread-local CUDA event could be used to manage this where, at buffer registration, an event is told to record the stream's actions up until that point. If the buffer needs to be spilled, if the spilling thread is different than the producing thread, we could then synchronize on the producing thread's CUDA event (which the RapidsBuffer needs to point to).

This is related to #4710.

The text was updated successfully, but these errors were encountered:

abellina · 2022-03-14T17:18:46Z

This is specific to the ASYNC allocator and should become a non-issue. The ARENA allocator makes sure that frees wait for the freeing stream to finish its work before reusing the memory. It does so by adding cudaStreamSynchronize on the freeing stream before finishing the deallocation call.

I believe we need to turn this issue into just testing high spill situations and double checking with the CUDA team that our use case here is covered by the reuse features in ASYNC:

cudaMemPoolReuseAllowOpportunistic for reuse after the driver detects that a freeing stream A is done, and so the freed memory is available, and
cudaMemPoolReuseAllowInternalDependencies for when a freeing stream A is not done and so the driver imposes stream ordering to allow another stream to reuse after stream A completes.

abellina · 2022-04-29T15:21:10Z

We have run for a while now with ASYNC in high-spill and high stream count situations with UCX, where we end up running OOM. I also can't reproduce any issues without UCX on ASYNC on a constrained (limit to 20GB of a 40GB GPU) run of NDS q72 with a single GPU at 3TB, this is with CUDA 11.5 so we have all of the reuse capabilities enabled per RMM.

Other than this type of test, I do not know what else to try, so I am going to close this. We can reopen it if we see issues or if others disagree.

abellina · 2022-04-29T18:00:11Z

Chatting with @jlowe he brought up questions that I have missed. He pointed out that we may still have missing state in either the allocating stream or the freeing stream (kernel launches specifically) that the allocator's guarantees can't possibly satisfy.

@jlowe

Running tests locally, but putting this up as WIP for now. Discussing with @jlowe a solution to NVIDIA/spark-rapids#4818 could involve `cudaDeviceSynchronize.` I noticed that's not in our JNI exposed calls, so I am adding it here. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #10839

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 17, 2022

abellina changed the title ~~[BUG] the spill store needs to synchronize on frees against the allocating stream~~ [BUG] the spill store needs to synchronize on spills against the allocating stream Feb 17, 2022

abellina added the P0 Must have for release label Feb 18, 2022

abellina self-assigned this Feb 18, 2022

jlowe removed the ? - Needs Triage Need team to review and classify label Feb 22, 2022

abellina mentioned this issue Feb 24, 2022

[BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator #4710

Closed

abellina mentioned this issue Mar 7, 2022

Make ARENA the default allocator for 22.04 #4909

Merged

abellina changed the title ~~[BUG] the spill store needs to synchronize on spills against the allocating stream~~ [BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream Mar 14, 2022

abellina closed this as completed Apr 29, 2022

abellina reopened this Apr 29, 2022

sameerz added this to the May 2 - May 20 milestone Apr 29, 2022

abellina mentioned this issue May 12, 2022

Adds the JNI call for Cuda.deviceSynchronize rapidsai/cudf#10839

Merged

abellina mentioned this issue May 13, 2022

Add cudaStreamSynchronize when a new device buffer is added to the spill framework #5485

Merged

abellina closed this as completed in #5485 May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818

[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818

abellina commented Feb 17, 2022 •

edited

Loading

abellina commented Mar 14, 2022

abellina commented Apr 29, 2022

abellina commented Apr 29, 2022

[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818

[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818

Comments

abellina commented Feb 17, 2022 • edited Loading

abellina commented Mar 14, 2022

abellina commented Apr 29, 2022

abellina commented Apr 29, 2022

abellina commented Feb 17, 2022 •

edited

Loading