-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream #4818
Comments
This is specific to the ASYNC allocator and should become a non-issue. The ARENA allocator makes sure that frees wait for the freeing stream to finish its work before reusing the memory. It does so by adding I believe we need to turn this issue into just testing high spill situations and double checking with the CUDA team that our use case here is covered by the reuse features in ASYNC:
|
We have run for a while now with ASYNC in high-spill and high stream count situations with UCX, where we end up running OOM. I also can't reproduce any issues without UCX on ASYNC on a constrained (limit to 20GB of a 40GB GPU) run of NDS q72 with a single GPU at 3TB, this is with CUDA 11.5 so we have all of the reuse capabilities enabled per RMM. Other than this type of test, I do not know what else to try, so I am going to close this. We can reopen it if we see issues or if others disagree. |
Chatting with @jlowe he brought up questions that I have missed. He pointed out that we may still have missing state in either the allocating stream or the freeing stream (kernel launches specifically) that the allocator's guarantees can't possibly satisfy. |
Running tests locally, but putting this up as WIP for now. Discussing with @jlowe a solution to NVIDIA/spark-rapids#4818 could involve `cudaDeviceSynchronize.` I noticed that's not in our JNI exposed calls, so I am adding it here. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #10839
When we register a buffer against the spill framework, the CPU may not be synchronized with the GPU stream that is adding the buffer. The GPU stream that is doing the freeing is also not synchronized. This is by design, and it is a good feature to have. We are tracking handles to GPU memory, but the state of the GPU streams is not something we are as concerned with at registration time.
At spill time that's a different story. Since spills happen from a task thread that ran OOM, the freeing CUDA stream may be (often is) different than the allocating stream (the stream that registered the buffer to begin with). This may mean we ask RMM or CUDA to free a pointer that isn't quite done.
In the recent move to ASYNC allocator it looks like it can be even worse: we could allocate a buffer asynchronously and turn around to spill it before it is done allocating (in the extreme).
Given PTDS, we can't reach at the allocating stream to introduce a synchronize, since the application only knows about a stream Id that is automatically managed by CUDA (it is the same ID for all threads from the application's point of view). A thread-local CUDA event could be used to manage this where, at buffer registration, an event is told to record the stream's actions up until that point. If the buffer needs to be spilled, if the spilling thread is different than the producing thread, we could then synchronize on the producing thread's CUDA event (which the
RapidsBuffer
needs to point to).This is related to #4710.
The text was updated successfully, but these errors were encountered: