Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818

Closed
abellina opened this issue Feb 17, 2022 · 3 comments · Fixed by #5485
Closed
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@abellina
Copy link
Collaborator

abellina commented Feb 17, 2022

When we register a buffer against the spill framework, the CPU may not be synchronized with the GPU stream that is adding the buffer. The GPU stream that is doing the freeing is also not synchronized. This is by design, and it is a good feature to have. We are tracking handles to GPU memory, but the state of the GPU streams is not something we are as concerned with at registration time.

At spill time that's a different story. Since spills happen from a task thread that ran OOM, the freeing CUDA stream may be (often is) different than the allocating stream (the stream that registered the buffer to begin with). This may mean we ask RMM or CUDA to free a pointer that isn't quite done.

In the recent move to ASYNC allocator it looks like it can be even worse: we could allocate a buffer asynchronously and turn around to spill it before it is done allocating (in the extreme).

Given PTDS, we can't reach at the allocating stream to introduce a synchronize, since the application only knows about a stream Id that is automatically managed by CUDA (it is the same ID for all threads from the application's point of view). A thread-local CUDA event could be used to manage this where, at buffer registration, an event is told to record the stream's actions up until that point. If the buffer needs to be spilled, if the spilling thread is different than the producing thread, we could then synchronize on the producing thread's CUDA event (which the RapidsBuffer needs to point to).

This is related to #4710.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 17, 2022
@abellina abellina changed the title [BUG] the spill store needs to synchronize on frees against the allocating stream [BUG] the spill store needs to synchronize on spills against the allocating stream Feb 17, 2022
@abellina abellina added the P0 Must have for release label Feb 18, 2022
@abellina abellina self-assigned this Feb 18, 2022
@jlowe jlowe removed the ? - Needs Triage Need team to review and classify label Feb 22, 2022
@abellina abellina changed the title [BUG] the spill store needs to synchronize on spills against the allocating stream [BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream Mar 14, 2022
@abellina
Copy link
Collaborator Author

This is specific to the ASYNC allocator and should become a non-issue. The ARENA allocator makes sure that frees wait for the freeing stream to finish its work before reusing the memory. It does so by adding cudaStreamSynchronize on the freeing stream before finishing the deallocation call.

I believe we need to turn this issue into just testing high spill situations and double checking with the CUDA team that our use case here is covered by the reuse features in ASYNC:

  • cudaMemPoolReuseAllowOpportunistic for reuse after the driver detects that a freeing stream A is done, and so the freed memory is available, and
  • cudaMemPoolReuseAllowInternalDependencies for when a freeing stream A is not done and so the driver imposes stream ordering to allow another stream to reuse after stream A completes.

@abellina
Copy link
Collaborator Author

We have run for a while now with ASYNC in high-spill and high stream count situations with UCX, where we end up running OOM. I also can't reproduce any issues without UCX on ASYNC on a constrained (limit to 20GB of a 40GB GPU) run of NDS q72 with a single GPU at 3TB, this is with CUDA 11.5 so we have all of the reuse capabilities enabled per RMM.

Other than this type of test, I do not know what else to try, so I am going to close this. We can reopen it if we see issues or if others disagree.

@abellina abellina reopened this Apr 29, 2022
@abellina
Copy link
Collaborator Author

Chatting with @jlowe he brought up questions that I have missed. He pointed out that we may still have missing state in either the allocating stream or the freeing stream (kernel launches specifically) that the allocator's guarantees can't possibly satisfy.

@sameerz sameerz added this to the May 2 - May 20 milestone Apr 29, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue May 12, 2022
Running tests locally, but putting this up as WIP for now.

Discussing with @jlowe a solution to NVIDIA/spark-rapids#4818 could involve `cudaDeviceSynchronize.` I noticed that's not in our JNI exposed calls, so I am adding it here.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #10839
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
3 participants