Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

Closed
kkraus14 opened this issue Feb 27, 2020 · 38 comments
Labels
bug Something isn't working inactive-30d inactive-90d Python Related to RMM Python API

Comments

@kkraus14
Copy link
Contributor

We recently merged a PR that does a cudaStreamSynchronize on any DeviceBuffer creation on the default stream. This was primarily to fix a UCX-py issue we were seeing where we were sending memory before it was necessarily computed.

@jakirkham The more I think about this the more I think we should be synchronizing before a send in UCX-py, but we shouldn't need to synchronize here. In general, calls should be enqueued onto the default stream and we shouldn't need to synchronize here to guarantee correct results / behavior. Things like copying to host memory would synchronize before returning, and kernel launches wouldn't actually execute until the host --> device copy was done anyway.

I'm a bit surprised that UCX-py is doing something that ends up requiring the default stream to be synchronized. I'd expect the CUDA API calls to either implicitly synchronize things or be enqueued onto the stream.

@kkraus14 kkraus14 added bug Something isn't working Python Related to RMM Python API labels Feb 27, 2020
@jakirkham
Copy link
Member

From my perspective this is the concerning situation.

hb = np.ones((n,), dtype="u1")

hb_ptr = hb.__array_interface__["data"][0]
hb_size = hb.nbytes

rdb = rmm.DeviceBuffer(ptr=hb_ptr, size=hb_size)

hb[:] = 2

# what is `rdb` here?

Now it may be synchronization already occurs due to other reasons here (I know I'm not the expert here so please weigh in). However if it does already occur, it is no worse if we perform the synchronization ourselves. If not, we should either be synchronizing to ensure data validity or we should be giving some thought to the APIs exposed to Python users (like is having a ptr in the constructor reasonable).

@pentschev
Copy link
Member

I did some testing with the examples we were seeing accuracy issues wasn’t able to reproduce them anymore. For my tests I used conda packages from today (0.13.0a200227), from two days ago (0.13.0a200225) and ten days ago (0.13.0a200217). In none of the builds I was able to reproduce the accuracy issues. I also didn’t experience any performance regressions with #309 which is a good thing, but mainly seems that the PR had no influence on the issue which seems that it has been fixed already for sometime.

@kkraus14
Copy link
Contributor Author

hb[:] = 2

# what is `rdb` here?

https://github.com/rapidsai/rmm/blob/branch-0.13/include/rmm/device_buffer.hpp#L120
https://github.com/rapidsai/rmm/blob/branch-0.13/python/rmm/_lib/device_buffer.pxd#L30

The input is marked as const here where this shouldn't be allowed, but because we're just casting an integer to a const void* we're bypassing this safety and allowing this UB it seems.

@jrhemstad
Copy link
Contributor

hb[:] = 2

# what is `rdb` here?

https://github.com/rapidsai/rmm/blob/branch-0.13/include/rmm/device_buffer.hpp#L120
https://github.com/rapidsai/rmm/blob/branch-0.13/python/rmm/_lib/device_buffer.pxd#L30

The input is marked as const here where this shouldn't be allowed, but because we're just casting an integer to a const void* we're bypassing this safety and allowing this UB it seems.

Not quite. When a C++ function marks it's input as const as in:

void do_stuff(foo const * f);

That's telling you that do_stuff will treat f as const and not modify it. It does not mean all fs passed into do_stuff need to be const.

foo const c_f;
foo f;

do_stuff(&c_f); // this is fine
do_stuff(&f); // this is fine too

As a general rule (with some caveats) it's always okay to add const in a function parameter. All it means is that function won't try to modify it.

Long story short, there's no reason to be casting to void const* (that is technically UB), you can just cast to void* and pass it into the function and it will work.

@kkraus14
Copy link
Contributor Author

Thanks for the explanation @jrhemstad. Either way it seems like we're in a bit of a pickle where we're dependent on synchronous behavior for our DeviceBuffer cython class construction but the only way to achieve that on the default stream is to synchronize the entire device.

I would welcome ideas / suggestions here on a more elegant / performant solution.

@jrhemstad
Copy link
Contributor

hb = np.ones((n,), dtype="u1")

hb_ptr = hb.__array_interface__["data"][0]
hb_size = hb.nbytes

rdb = rmm.DeviceBuffer(ptr=hb_ptr, size=hb_size)

hb[:] = 2

# what is `rdb` here?

To @jakirkham's point, this is a race condition. device_buffer uses a cudaMemcpyAsync which returns control to the host immediately. So you could be writing into hb concurrently while device_buffer is copying from it.

@kkraus14
Copy link
Contributor Author

Yes, which is what we're trying to guard against, ideally without synchronizing the entire device.

@jrhemstad
Copy link
Contributor

I would welcome ideas / suggestions here on a more elegant / performant solution.

One option is to just always build libcudf with PTDS mode enabled.

@kkraus14
Copy link
Contributor Author

I would welcome ideas / suggestions here on a more elegant / performant solution.

One option is to just always build libcudf with PTDS mode enabled.

That doesn't affect this Cython code that produces its own .so though, right?

@jrhemstad
Copy link
Contributor

That doesn't affect this Cython code that produces its own .so though, right?

Can you specify the flag to Cython compilation?

@kkraus14
Copy link
Contributor Author

That doesn't affect this Cython code that produces its own .so though, right?

Can you specify the flag to Cython compilation?

Yes we likely can, but does it work with cnmem under the hood?

@jrhemstad
Copy link
Contributor

Yes we likely can, but does it work with cnmem under the hood?

Not if you try and use CNMEM w/ multiple threads. If you're only using one thread then it's fine.

@jrhemstad
Copy link
Contributor

Stepping back a moment, the real problem with using/syncing the default stream all over the place is that it is implicitly synchronous with other streams.

However, there is a workaround for this that doesn't involve PTDS. If/when we want to create additional streams in libcudf or cuDF, we can specify on creation that they should not be synchronous with the default stream by using the cudaStreamNonBlocking flag.

cudaStreamCreate(&s0);
cudaStreamCreate(&s1, cudaStreamNonBlocking);

kernelA<<<...,0>>>(...); // null stream
kernelB<<<...,s0>>>(...); // this cannot overlap with kernelA
kernelC<<<...,s1>>>(...); // this CAN overlap with kernelA

See https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1gb1e32aff9f59119e4d0a9858991c4ad3

@kkraus14
Copy link
Contributor Author

I believe ucx-py / distributed end up using some threads that may or may not create rmm.DeviceBuffer objects where I'm not sure if using PTDS is feasible. @jakirkham @pentschev @madsbk would know more than me.

@jakirkham
Copy link
Member

Thanks all for thinking about this. 🙂

What if we start relying on Numba/Dask-CUDA to provide a stream per worker which we sync on instead? This came up a bit in issue ( rapidsai/dask-cuda#96 ). Are there any challenges we foresee with this proposal?

@kkraus14
Copy link
Contributor Author

Are there any challenges we foresee with this proposal?

We then need to be able to hand the Python stream abstraction through every library that we use and that's not currently possible as far as I know. Additionally, it requires libcudf supporting explicitly passing a stream to it's C++ APIs which I'm not sure of the state of.

@harrism
Copy link
Member

harrism commented Feb 27, 2020

Additionally, it requires libcudf supporting explicitly passing a stream to it's C++ APIs which I'm not sure of the state of.

It's something we were hoping to avoid, but could consider. We were hoping to take the thrust route and use futures instead of streams...

@harrism
Copy link
Member

harrism commented Feb 27, 2020

To the original question about the race condition in @jakirkham 's snippet. What if we changed device_buffer to use cudaMemcpy when stream is 0 instead of cudaMemcpyAsync? i.e. make device_buffer fully synchronous on the default stream (or possibly on the default stream when PTDS is NOT enabled).

@jakirkham
Copy link
Member

Thanks for the feedback. Yeah it's good to think through the impact here. Agree that seems pretty large in scope. What if this is something that was tidily handled under-the-hood in RMM DeviceBuffer's (meaning the Cython side)? So not requiring changes to callers themselves.

Yeah that's a good point Mark. Using cudaMemcpy sounds like a good idea to me 🙂

@harrism
Copy link
Member

harrism commented Feb 28, 2020

Well, @jrhemstad wrote device_buffer so he should cast an opinion. I was asking a what if, not expressing an opinion on whether it's a good idea! :)

@jrhemstad
Copy link
Contributor

I don't really think there's any observable difference in using cudaMemcpy vs. a cudaMemcpyAsync(...,0) + cudaStreamSynchronize(0), but I'd have to think about it a little more.

@jrhemstad
Copy link
Contributor

Yeah, as I expected, there's effectively no difference between cudaMemcpy vs. cudaMemcpyAsync + cudaStreamSynchronize.

See: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated.

So cudaMemcpy is internally doing a cudaStreamSyncrhonize(0).

@jakirkham
Copy link
Member

So it’s not associating it with a non-default stream internally?

@jrhemstad
Copy link
Contributor

No.

@kkraus14
Copy link
Contributor Author

@jrhemstad what if the cudaMemcpyAsync is using non-pinned host memory? I would assume that would have similar synchronization behavior? That would be the case for us about 99.9999% of the time in Python currently.

@jrhemstad
Copy link
Contributor

@jrhemstad what if the cudaMemcpyAsync is using non-pinned host memory? I would assume that would have similar synchronization behavior? That would be the case for us about 99.9999% of the time in Python currently.

Yep, cudaMemcpyAsync is always async wrt to the host. The use of pinned buffers affects its ability to be async wrt the device and overlap copy/compute.

@jakirkham
Copy link
Member

Ok so maybe I’m missing something. Is RMM thread-safe?

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 29, 2020

Ok so maybe I’m missing something. Is RMM thread-safe?

RMM is not really an entity that can or cannot be threadsafe.

RMM is simply a facade over device_memory_resources that are or are not threadsafe. All APIs related to setting/getting the default are threadsafe, but that's about the limit of APIs RMM has independent of resources.

All resources we currently have are threadsafe, including cnmem_memory_resource, which is the resource we currently use when you think of "pool" mode (though it is not the only pool resource we have in the works).

Though if/when we have an individual resource that is not threadsafe, it is simple enough to make it threadsafe through an adaptor.

@jakirkham
Copy link
Member

Thanks for the clarifications Jake. Sorry I may have misunderstood something about RMM and multithreading previously.

@harrism
Copy link
Member

harrism commented Mar 2, 2020

@jakirkham You are probably remembering me saying that the memory_resources in #162 are not thread safe. They iare not (we will wrap them in another resource that provides thread safety). But cnmem is internally thread safe, so the cnmem_memory_resource is as well.

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@harrism
Copy link
Member

harrism commented Feb 16, 2021

@kkraus14 still relevant?

@jakirkham
Copy link
Member

I think @pentschev already fixed this in PR ( #650 )

@harrism harrism closed this as completed Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inactive-30d inactive-90d Python Related to RMM Python API
Projects
None yet
Development

No branches or pull requests

5 participants