[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

kkraus14 · 2020-02-27T06:44:55Z

We recently merged a PR that does a cudaStreamSynchronize on any DeviceBuffer creation on the default stream. This was primarily to fix a UCX-py issue we were seeing where we were sending memory before it was necessarily computed.

@jakirkham The more I think about this the more I think we should be synchronizing before a send in UCX-py, but we shouldn't need to synchronize here. In general, calls should be enqueued onto the default stream and we shouldn't need to synchronize here to guarantee correct results / behavior. Things like copying to host memory would synchronize before returning, and kernel launches wouldn't actually execute until the host --> device copy was done anyway.

I'm a bit surprised that UCX-py is doing something that ends up requiring the default stream to be synchronized. I'd expect the CUDA API calls to either implicitly synchronize things or be enqueued onto the stream.

The text was updated successfully, but these errors were encountered:

jakirkham · 2020-02-27T08:17:32Z

From my perspective this is the concerning situation.

hb = np.ones((n,), dtype="u1")

hb_ptr = hb.__array_interface__["data"][0]
hb_size = hb.nbytes

rdb = rmm.DeviceBuffer(ptr=hb_ptr, size=hb_size)

hb[:] = 2

# what is `rdb` here?

Now it may be synchronization already occurs due to other reasons here (I know I'm not the expert here so please weigh in). However if it does already occur, it is no worse if we perform the synchronization ourselves. If not, we should either be synchronizing to ensure data validity or we should be giving some thought to the APIs exposed to Python users (like is having a ptr in the constructor reasonable).

pentschev · 2020-02-27T12:38:59Z

I did some testing with the examples we were seeing accuracy issues wasn’t able to reproduce them anymore. For my tests I used conda packages from today (0.13.0a200227), from two days ago (0.13.0a200225) and ten days ago (0.13.0a200217). In none of the builds I was able to reproduce the accuracy issues. I also didn’t experience any performance regressions with #309 which is a good thing, but mainly seems that the PR had no influence on the issue which seems that it has been fixed already for sometime.

kkraus14 · 2020-02-27T13:13:17Z

hb[:] = 2

# what is `rdb` here?

https://github.com/rapidsai/rmm/blob/branch-0.13/include/rmm/device_buffer.hpp#L120
https://github.com/rapidsai/rmm/blob/branch-0.13/python/rmm/_lib/device_buffer.pxd#L30

The input is marked as const here where this shouldn't be allowed, but because we're just casting an integer to a const void* we're bypassing this safety and allowing this UB it seems.

jrhemstad · 2020-02-27T13:51:25Z

hb[:] = 2

# what is `rdb` here?
https://github.com/rapidsai/rmm/blob/branch-0.13/include/rmm/device_buffer.hpp#L120
https://github.com/rapidsai/rmm/blob/branch-0.13/python/rmm/_lib/device_buffer.pxd#L30

The input is marked as const here where this shouldn't be allowed, but because we're just casting an integer to a const void* we're bypassing this safety and allowing this UB it seems.

Not quite. When a C++ function marks it's input as const as in:

void do_stuff(foo const * f);

That's telling you that do_stuff will treat f as const and not modify it. It does not mean all fs passed into do_stuff need to be const.

foo const c_f;
foo f;

do_stuff(&c_f); // this is fine
do_stuff(&f); // this is fine too

As a general rule (with some caveats) it's always okay to add const in a function parameter. All it means is that function won't try to modify it.

Long story short, there's no reason to be casting to void const* (that is technically UB), you can just cast to void* and pass it into the function and it will work.

kkraus14 · 2020-02-27T14:09:54Z

Thanks for the explanation @jrhemstad. Either way it seems like we're in a bit of a pickle where we're dependent on synchronous behavior for our DeviceBuffer cython class construction but the only way to achieve that on the default stream is to synchronize the entire device.

I would welcome ideas / suggestions here on a more elegant / performant solution.

jrhemstad · 2020-02-27T14:11:14Z

hb = np.ones((n,), dtype="u1")

hb_ptr = hb.__array_interface__["data"][0]
hb_size = hb.nbytes

rdb = rmm.DeviceBuffer(ptr=hb_ptr, size=hb_size)

hb[:] = 2

# what is `rdb` here?

To @jakirkham's point, this is a race condition. device_buffer uses a cudaMemcpyAsync which returns control to the host immediately. So you could be writing into hb concurrently while device_buffer is copying from it.

kkraus14 · 2020-02-27T14:19:21Z

Yes, which is what we're trying to guard against, ideally without synchronizing the entire device.

jrhemstad · 2020-02-27T14:24:34Z

I would welcome ideas / suggestions here on a more elegant / performant solution.

One option is to just always build libcudf with PTDS mode enabled.

kkraus14 · 2020-02-27T14:26:49Z

I would welcome ideas / suggestions here on a more elegant / performant solution.

One option is to just always build libcudf with PTDS mode enabled.

That doesn't affect this Cython code that produces its own .so though, right?

jrhemstad · 2020-02-27T14:30:21Z

That doesn't affect this Cython code that produces its own .so though, right?

Can you specify the flag to Cython compilation?

kkraus14 · 2020-02-27T14:31:28Z

That doesn't affect this Cython code that produces its own .so though, right?

Can you specify the flag to Cython compilation?

Yes we likely can, but does it work with cnmem under the hood?

jrhemstad · 2020-02-27T14:32:30Z

Yes we likely can, but does it work with cnmem under the hood?

Not if you try and use CNMEM w/ multiple threads. If you're only using one thread then it's fine.

jrhemstad · 2020-02-27T14:44:08Z

Stepping back a moment, the real problem with using/syncing the default stream all over the place is that it is implicitly synchronous with other streams.

However, there is a workaround for this that doesn't involve PTDS. If/when we want to create additional streams in libcudf or cuDF, we can specify on creation that they should not be synchronous with the default stream by using the cudaStreamNonBlocking flag.

cudaStreamCreate(&s0);
cudaStreamCreate(&s1, cudaStreamNonBlocking);

kernelA<<<...,0>>>(...); // null stream
kernelB<<<...,s0>>>(...); // this cannot overlap with kernelA
kernelC<<<...,s1>>>(...); // this CAN overlap with kernelA

See https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1gb1e32aff9f59119e4d0a9858991c4ad3

kkraus14 · 2020-02-27T14:56:26Z

I believe ucx-py / distributed end up using some threads that may or may not create rmm.DeviceBuffer objects where I'm not sure if using PTDS is feasible. @jakirkham @pentschev @madsbk would know more than me.

jakirkham · 2020-02-27T19:20:59Z

Thanks all for thinking about this. 🙂

What if we start relying on Numba/Dask-CUDA to provide a stream per worker which we sync on instead? This came up a bit in issue ( rapidsai/dask-cuda#96 ). Are there any challenges we foresee with this proposal?

kkraus14 · 2020-02-27T21:42:30Z

Are there any challenges we foresee with this proposal?

We then need to be able to hand the Python stream abstraction through every library that we use and that's not currently possible as far as I know. Additionally, it requires libcudf supporting explicitly passing a stream to it's C++ APIs which I'm not sure of the state of.

harrism · 2020-02-27T23:37:41Z

Additionally, it requires libcudf supporting explicitly passing a stream to it's C++ APIs which I'm not sure of the state of.

It's something we were hoping to avoid, but could consider. We were hoping to take the thrust route and use futures instead of streams...

harrism · 2020-02-27T23:40:09Z

To the original question about the race condition in @jakirkham 's snippet. What if we changed device_buffer to use cudaMemcpy when stream is 0 instead of cudaMemcpyAsync? i.e. make device_buffer fully synchronous on the default stream (or possibly on the default stream when PTDS is NOT enabled).

jakirkham · 2020-02-28T00:09:55Z

Thanks for the feedback. Yeah it's good to think through the impact here. Agree that seems pretty large in scope. What if this is something that was tidily handled under-the-hood in RMM DeviceBuffer's (meaning the Cython side)? So not requiring changes to callers themselves.

Yeah that's a good point Mark. Using cudaMemcpy sounds like a good idea to me 🙂

harrism · 2020-02-28T00:13:15Z

Well, @jrhemstad wrote device_buffer so he should cast an opinion. I was asking a what if, not expressing an opinion on whether it's a good idea! :)

jrhemstad · 2020-02-28T03:53:23Z

I don't really think there's any observable difference in using cudaMemcpy vs. a cudaMemcpyAsync(...,0) + cudaStreamSynchronize(0), but I'd have to think about it a little more.

jrhemstad · 2020-02-28T14:12:18Z

Yeah, as I expected, there's effectively no difference between cudaMemcpy vs. cudaMemcpyAsync + cudaStreamSynchronize.

See: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated.

So cudaMemcpy is internally doing a cudaStreamSyncrhonize(0).

jakirkham · 2020-02-28T15:49:53Z

So it’s not associating it with a non-default stream internally?

jrhemstad · 2020-02-28T16:15:05Z

No.

kkraus14 · 2020-02-28T19:33:24Z

@jrhemstad what if the cudaMemcpyAsync is using non-pinned host memory? I would assume that would have similar synchronization behavior? That would be the case for us about 99.9999% of the time in Python currently.

jrhemstad · 2020-02-28T22:49:36Z

@jrhemstad what if the cudaMemcpyAsync is using non-pinned host memory? I would assume that would have similar synchronization behavior? That would be the case for us about 99.9999% of the time in Python currently.

Yep, cudaMemcpyAsync is always async wrt to the host. The use of pinned buffers affects its ability to be async wrt the device and overlap copy/compute.

jakirkham · 2020-02-29T14:12:15Z

Ok so maybe I’m missing something. Is RMM thread-safe?

jrhemstad · 2020-02-29T14:17:27Z

Ok so maybe I’m missing something. Is RMM thread-safe?

RMM is not really an entity that can or cannot be threadsafe.

RMM is simply a facade over device_memory_resources that are or are not threadsafe. All APIs related to setting/getting the default are threadsafe, but that's about the limit of APIs RMM has independent of resources.

All resources we currently have are threadsafe, including cnmem_memory_resource, which is the resource we currently use when you think of "pool" mode (though it is not the only pool resource we have in the works).

Though if/when we have an individual resource that is not threadsafe, it is simple enough to make it threadsafe through an adaptor.

jakirkham · 2020-02-29T14:31:52Z

Thanks for the clarifications Jake. Sorry I may have misunderstood something about RMM and multithreading previously.

harrism · 2020-03-02T08:28:01Z

@jakirkham You are probably remembering me saying that the memory_resources in #162 are not thread safe. They iare not (we will wrap them in another resource that provides thread safety). But cnmem is internally thread safe, so the cnmem_memory_resource is as well.

github-actions · 2021-02-16T17:29:39Z

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-02-16T17:30:03Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

harrism · 2021-02-16T23:41:19Z

@kkraus14 still relevant?

jakirkham · 2021-02-17T00:31:45Z

I think @pentschev already fixed this in PR ( #650 )

kkraus14 added bug Something isn't working Python Related to RMM Python API labels Feb 27, 2020

jakirkham mentioned this issue Feb 27, 2020

Sync default stream in DeviceBuffer constructor #309

Merged

leofang mentioned this issue Mar 1, 2020

Specify synchronization semantics of CUDA Array Interface numba/numba#5162

Merged

jakirkham mentioned this issue Jul 31, 2020

[QST] Unclear how to saturate PCI for H->D and D->H #451

Closed

jakirkham mentioned this issue Oct 23, 2020

Add stream wrapper types #608

Merged

kkraus14 mentioned this issue Dec 3, 2020

[BUG] Python rmm.DeviceBuffer should only synchronize the stream when constructing from host buffer #646

Closed

github-actions bot added the inactive-90d label Feb 16, 2021

github-actions bot added the inactive-30d label Feb 16, 2021

harrism closed this as completed Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

kkraus14 commented Feb 27, 2020

jakirkham commented Feb 27, 2020

pentschev commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jakirkham commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

harrism commented Feb 27, 2020

harrism commented Feb 27, 2020

jakirkham commented Feb 28, 2020

harrism commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

jakirkham commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

kkraus14 commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

jakirkham commented Feb 29, 2020

jrhemstad commented Feb 29, 2020 •

edited

Loading

jakirkham commented Feb 29, 2020

harrism commented Mar 2, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented Feb 16, 2021

harrism commented Feb 16, 2021

jakirkham commented Feb 17, 2021

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

[BUG] Shouldn't need to synchronize on creation of DeviceBuffer even in the default stream #313

Comments

kkraus14 commented Feb 27, 2020

jakirkham commented Feb 27, 2020

pentschev commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

jrhemstad commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

jakirkham commented Feb 27, 2020

kkraus14 commented Feb 27, 2020

harrism commented Feb 27, 2020

harrism commented Feb 27, 2020

jakirkham commented Feb 28, 2020

harrism commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

jakirkham commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

kkraus14 commented Feb 28, 2020

jrhemstad commented Feb 28, 2020

jakirkham commented Feb 29, 2020

jrhemstad commented Feb 29, 2020 • edited Loading

jakirkham commented Feb 29, 2020

harrism commented Mar 2, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented Feb 16, 2021

harrism commented Feb 16, 2021

jakirkham commented Feb 17, 2021

jrhemstad commented Feb 29, 2020 •

edited

Loading