Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudaStreamSynchronize when a new device buffer is added to the spill framework #5485

Merged
merged 6 commits into from
May 17, 2022

Conversation

abellina
Copy link
Collaborator

Closes #4818.

This is to fix the stream-ordered violation we could see currently in the spill framework: a buffer is allocated and acted on in stream A, and added to the store without any synchronization. Then stream B runs OOM and spills such buffer, not having waited on stream A.

The idea here is that if a device buffer is added to the store, we force in most cases a call to cudaStreamSynchronize, which should take care of the cases where we return from cuDF and have not stream synchronized there. There is a lot of stream synchronization in cuDF (via thrift usually) so it seems like this is an unlikely issue. That said, adding this synchronize, would be more robust as now the spill framework knows for a fact that the buffer can be freely copied (spilled).

Note that I added a flag to not synchronize in some cases, this is used for UCX since we synchronize on writes after compressing buffers, and also on reads after copying from the bounce buffer, so those two cases we can reliably say we have synchronized. I don't know of other cases where the synchronize in the spill framework can be skipped.

I have been measuring the impact of this in our benchmarks. I do not see any big differences so far in UCX or non-UCX runs. I am going to continue testing it today. Adding as WIP for comments.

@abellina abellina added the bug Something isn't working label May 13, 2022
@abellina abellina added this to the May 2 - May 20 milestone May 13, 2022
@abellina abellina requested a review from jlowe May 13, 2022 14:49
jlowe
jlowe previously approved these changes May 13, 2022
@abellina
Copy link
Collaborator Author

The fix 5ad4d43 helps with some queries I saw an obvious regression (NDS q20 specifically). This makes it so that contig split is followed by a sync in the write path for the RapidsShuffleManager. I am still not done assessing the performance impacts of this, I should have numbers Monday.

@abellina abellina marked this pull request as ready for review May 16, 2022 16:01
@abellina
Copy link
Collaborator Author

All tests I have executed with this PR show results in the noise, so I do not see blockers.

@abellina
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes May 16, 2022
@abellina
Copy link
Collaborator Author

ok I have related test failures, due to some mock object matchers I had used. I'll fix that soon

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit 10893db into NVIDIA:branch-22.06 May 17, 2022
@abellina abellina deleted the spill/async_alloc_fixes branch May 17, 2022 16:46
anthony-chang pushed a commit to anthony-chang/spark-rapids that referenced this pull request May 17, 2022
…ill framework (NVIDIA#5485)

* Adds a stream synchronize in addBuffer to ensure we safely spill

* Small cleanup in copyBuffer, add note about createBuffer synchronation requirements

Signed-off-by: Alessandro Bellina <[email protected]>

* Remove extra nvtx range

* When adding contiguous_split buffers in RapidsShuffleManager, synchronize once

* Fix RapidsShuffleTestHelper

* Fix RapidsShuffleClientSuite
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream
2 participants