-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313
Comments
From my perspective this is the concerning situation. hb = np.ones((n,), dtype="u1")
hb_ptr = hb.__array_interface__["data"][0]
hb_size = hb.nbytes
rdb = rmm.DeviceBuffer(ptr=hb_ptr, size=hb_size)
hb[:] = 2
# what is `rdb` here? Now it may be synchronization already occurs due to other reasons here (I know I'm not the expert here so please weigh in). However if it does already occur, it is no worse if we perform the synchronization ourselves. If not, we should either be synchronizing to ensure data validity or we should be giving some thought to the APIs exposed to Python users (like is having a |
I did some testing with the examples we were seeing accuracy issues wasn’t able to reproduce them anymore. For my tests I used conda packages from today (0.13.0a200227), from two days ago (0.13.0a200225) and ten days ago (0.13.0a200217). In none of the builds I was able to reproduce the accuracy issues. I also didn’t experience any performance regressions with #309 which is a good thing, but mainly seems that the PR had no influence on the issue which seems that it has been fixed already for sometime. |
https://github.com/rapidsai/rmm/blob/branch-0.13/include/rmm/device_buffer.hpp#L120 The input is marked as |
Not quite. When a C++ function marks it's input as void do_stuff(foo const * f); That's telling you that foo const c_f;
foo f;
do_stuff(&c_f); // this is fine
do_stuff(&f); // this is fine too As a general rule (with some caveats) it's always okay to add Long story short, there's no reason to be casting to |
Thanks for the explanation @jrhemstad. Either way it seems like we're in a bit of a pickle where we're dependent on synchronous behavior for our I would welcome ideas / suggestions here on a more elegant / performant solution. |
To @jakirkham's point, this is a race condition. |
Yes, which is what we're trying to guard against, ideally without synchronizing the entire device. |
One option is to just always build libcudf with PTDS mode enabled. |
That doesn't affect this Cython code that produces its own |
Can you specify the flag to Cython compilation? |
Yes we likely can, but does it work with cnmem under the hood? |
Not if you try and use CNMEM w/ multiple threads. If you're only using one thread then it's fine. |
Stepping back a moment, the real problem with using/syncing the default stream all over the place is that it is implicitly synchronous with other streams. However, there is a workaround for this that doesn't involve PTDS. If/when we want to create additional streams in libcudf or cuDF, we can specify on creation that they should not be synchronous with the default stream by using the cudaStreamCreate(&s0);
cudaStreamCreate(&s1, cudaStreamNonBlocking);
kernelA<<<...,0>>>(...); // null stream
kernelB<<<...,s0>>>(...); // this cannot overlap with kernelA
kernelC<<<...,s1>>>(...); // this CAN overlap with kernelA |
I believe ucx-py / distributed end up using some threads that may or may not create |
Thanks all for thinking about this. 🙂 What if we start relying on Numba/Dask-CUDA to provide a stream per worker which we sync on instead? This came up a bit in issue ( rapidsai/dask-cuda#96 ). Are there any challenges we foresee with this proposal? |
We then need to be able to hand the Python stream abstraction through every library that we use and that's not currently possible as far as I know. Additionally, it requires libcudf supporting explicitly passing a stream to it's C++ APIs which I'm not sure of the state of. |
It's something we were hoping to avoid, but could consider. We were hoping to take the thrust route and use futures instead of streams... |
To the original question about the race condition in @jakirkham 's snippet. What if we changed |
Thanks for the feedback. Yeah it's good to think through the impact here. Agree that seems pretty large in scope. What if this is something that was tidily handled under-the-hood in RMM Yeah that's a good point Mark. Using |
Well, @jrhemstad wrote |
I don't really think there's any observable difference in using |
Yeah, as I expected, there's effectively no difference between See: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior
So |
So it’s not associating it with a non-default stream internally? |
No. |
@jrhemstad what if the |
Yep, |
Ok so maybe I’m missing something. Is RMM thread-safe? |
RMM is not really an entity that can or cannot be threadsafe. RMM is simply a facade over All resources we currently have are threadsafe, including Though if/when we have an individual resource that is not threadsafe, it is simple enough to make it threadsafe through an adaptor. |
Thanks for the clarifications Jake. Sorry I may have misunderstood something about RMM and multithreading previously. |
@jakirkham You are probably remembering me saying that the memory_resources in #162 are not thread safe. They iare not (we will wrap them in another resource that provides thread safety). But cnmem is internally thread safe, so the |
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
@kkraus14 still relevant? |
I think @pentschev already fixed this in PR ( #650 ) |
We recently merged a PR that does a
cudaStreamSynchronize
on anyDeviceBuffer
creation on the default stream. This was primarily to fix a UCX-py issue we were seeing where we were sending memory before it was necessarily computed.@jakirkham The more I think about this the more I think we should be synchronizing before a send in
UCX-py
, but we shouldn't need to synchronize here. In general, calls should be enqueued onto the default stream and we shouldn't need to synchronize here to guarantee correct results / behavior. Things like copying to host memory would synchronize before returning, and kernel launches wouldn't actually execute until the host --> device copy was done anyway.I'm a bit surprised that UCX-py is doing something that ends up requiring the default stream to be synchronized. I'd expect the CUDA API calls to either implicitly synchronize things or be enqueued onto the stream.
The text was updated successfully, but these errors were encountered: