Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Discuss and document expected stream synchronization behavior of libcudf functions #4511

Closed
jrhemstad opened this issue Mar 14, 2020 · 21 comments · Fixed by #11853
Closed
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 14, 2020

Report needed documentation

Today, the (a)synchronous behavior of cuDF/libcudf functions is not formally documented (nor has it really been discussed to my knowledge).

The current state of affairs is that no libcudf function explicitly synchronizes before returning. This means that the result buffers from libcudf functions may not immediately be safe to use.

Thus far, this has been okay because of a few reasons:

  • We only use the default stream
  • Any operation we do on the result buffer is stream-ordered

This means that for any function:

result = cudf::function(...)

when cudf::function returns, the buffer in result may not be safe to immediately use. However, any operations like copying result to host or using it in another kernel are all safe because they are stream-ordered. Meaning, any stream ordered operations after cudf::function will be en-queued in the stream such that by the time those operations occur, previous operations in the stream will have completed (i.e., result is safe to use).

Where this breaks down:

  • Anyone that uses a non-blocking, non-default stream
  • Anything that assumes the result buffer is immediately safe to use (like UCX)

None of this information is really documented anywhere, so we should probably remedy that.

Describe the documentation you'd like

The synchronous behavior of the library is very important to get right. We should have a conversation about what we have now and if it is what we think is correct for the long term future of the library.

Compare to Thrust which takes a more pessimistic approach to synchronization and synchronizes the default stream before returning from functions like reductions or constructing a device_vector.

@jrhemstad jrhemstad added doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Mar 14, 2020
@jrhemstad
Copy link
Contributor Author

This conversation relates to #925 and if/when we ever want to make streams a part of the public interface.

@jrhemstad jrhemstad added the Spark Functionality that helps Spark RAPIDS label Mar 19, 2020
@lightsighter
Copy link

In Legate, we have two problems with using the current synchronization model in cuDF. First, all of our streams are non-blocking, non-default streams. This allows us to convert task parallelism discovered by the Legion runtime into kernel parallelism visible to the CUDA driver. This does make using cuDF currently unsafe unless we can understand the synchronization model for "effects" (kernel launches and asynchronous memcpy operations) performed during a call into cuDF. There's at least two possible solutions to this problem (maybe more):

  1. Most CUDA libraries today solve this by having a call to set a "current stream" for the library to use, like cublas
  2. An alternative option would be for every call to take an optional cudaEvent_t* which defaults to NULL and if the user passes in a non-NULL pointer, then the library will return a CUDA event that encapsulates the effects of that cuDF API call.

I think we would have a slight preference for the second option, as it keeps the library implementation functional and stateless, which means that it can be shared across multiple GPUs and can be called into simultaneously with different CUDA contexts from different threads.

The second problem that we have is that it is not obvious which cuDF API calls contain synchronization calls which are in violation of cudaStreamCapture semantics and therefore it is difficult to know when it might be safe for users to use cudaStreamCapture around cuDF calls.

The first problem has a significantly higher priority to us than the second, but for completeness I decided to include it.

@harrism
Copy link
Member

harrism commented Apr 22, 2020

Option 3. would be for each libcudf API to take a stream parameter, and to order the API's actions in that stream. This avoids the statefulness of option 1 (inconsistent with current libcudf design) and returning outputs via pointer parameter (inconsistent with current libcudf design) of option 2.

@lightsighter
Copy link

That would work for us as long as all "effects" performed by the cudf API call would be encapsulated on that stream. Some cursory investigation seems to show that there are a few API calls in cudf that make or use their own streams for internal work. I haven't checked to see whether all the kernels/copy operations placed on those internal streams make it back to the default stream or a potential external stream passed in by the user.

@harrism
Copy link
Member

harrism commented Apr 22, 2020

I agree, and I don't think it would be useable if libcudf forked onto internal streams without waiting on events in those streams in the externally specified stream.

I count only one call to cudaStreamCreate() in non-legacy code in libcudf currently (it's in groupby_without_aggregation.cu).

Correction, that's legacy too. There are no calls to cudaStreamCreate() in libcudf currently except in tests.

@jrhemstad
Copy link
Contributor Author

What makes me uncomfortable about Option 3 and just adding a stream parameter to every API is it's potential for being a foot gun. E.g.,

Option A:

column result = cudf::some_function(..., s1);

The user has to be very aware that they cannot touch result until you've synchronized with s1.

I think a futures based API would be much safer and explicit about the synchronization semantics.

Option B

future<column> result = cudf::some_function(...);

result.wait(); // performs necessary synchronization to ensure the result column is safe to use

Note that in this example an explicit stream wasn't used. The idea would be that libcudf would abstract away the use of a non-default stream by using something like a stream pool. Granted, that's a lot more work. Alternatively, we could mix futures and explicit streams:

Option C:

future<column> result = cudf::some_function(..., s1);
result.wait(); // performs necessary synchronization to ensure the result column is safe to use

C is a bit of a middle ground between A and B. It alleviates some of the potential foot gun of A by making the caller explicitly aware that the result is asynchronously returned, without the added libcudf complexity of managing a stream pool in B.

@jlowe
Copy link
Member

jlowe commented Apr 22, 2020

How would options B and C support avoiding synchronization when it is unnecessary? I can see how these approaches make it harder to avoid races, but it also seems like it makes it harder to avoid unnecessary synchronization. Chaining a series of libcudf calls would be painful if the caller needs to wait() on an output in order to pass it as an input to the next libcudf call.

Would libcudf methods start taking future<column> as input parameters and "peer under the hood" to find the underlying stream to know if they need to explicitly synchronize? Would applications using their own kernels or other CUDA-based libs always have to synchronize libcudf results using wait() or would there be a way to determine if synchronization is necessary given a stream to be used for further processing?

@jrhemstad
Copy link
Contributor Author

How would options B and C support avoiding synchronization when it is unnecessary?

Yeah, this has always been the roadblock for me in terms of figuring out how we could have a future-based API. It seems like we'd need to re-architect all the APIs to not only return futures, but also accept them as input parameters (which really messes things up like the separation between column and column_view).

Alternatively, one idea is to have future::get take a stream parameter:

future<column> result = cudf::some_function(..., s1);

column = result.get(s1); // detects that you're retrieving the object on the same stream as it was generated, no sync performed

@lightsighter
Copy link

I'm not sure about your implementation of the column type, but you could maybe make the column type wrap the future internally. It would add a layer of indirection inside your implementation in that you would need to check each column argument to see if it was the actual version or the future version and handle each appropriately, but it also means you wouldn't need to support a "split API" where you have future and non-future implementations of every API call. If users query values of the column that need to return actual values and there is a future inside then you could implicitly perform the synchronization for them to keep them safe.

@jrhemstad
Copy link
Contributor Author

you could maybe make the column type wrap the future internally.

It's a good idea, but unfortunately impossible.

All libcudf APIs are setup like this:

column some_function(column_view input);

The output is an owning cudf::column object, the input is a non-owning view object (similar to a std::string vs std::string_view).

We could hide the future inside column, but not in column_view since column_view is a simple, non-owning type.

@harrism
Copy link
Member

harrism commented Apr 24, 2020

Thrust solves this problem by having execution policies that can depend on futures:

auto x = thrust::reduce_async(...)
auto y = thrust::reduce_async(...) // independent of x
thrust::inclusive_scan(thrust_after(x, y), ...) // dependent on x and y

I may not have the syntax exactly right because I can't find an actual example of this, only a mention in the release notes: https://github.com/thrust/thrust/releases/tag/1.9.4

And probably not a great example since my inclusive_scan above wouldn't actually depend on the reduces before it. But you get the idea.

@trevorsm7
Copy link
Contributor

I went looking for documentation on the thrust::async stuff but wasn't able to find any either. The execution policy that uses these looks interesting, and it also seems you convert this to a thrust:::event which maybe exposes a cudaEvent that can be used with cudaStreamWaitEvent()? The release notes also mention that the thrust::future handles keeping temporary allocations alive until the result is ready, which may not be necessary with rmm allocations (deallocation is asynchronous and stream ordered, right?) but seems useful to consider.

I'm wondering though if it might make sense for streams to be added as members of both the column and column_view classes. This would of course mean that a function operating on multiple columns could potentially be receiving multiple input streams that would need to be synchronized with each other internally. For example:

// Do some async operation on data_a in stream_a
some_kernel<<<grid,block,0,stream_a>>>(data_a);

// Do another async operation on data_b in stream_b
another_kernel<<<grid,block,0,stream_b>>>(data_b);

// Create column_views with the streams they're ordered on
auto a = cudf::column_view{data_type{INT32}, size_a, data_a, stream_a};
auto b = cudf::column_view{data_type{INT32}, size_b, data_b, stream_b};

// Concatenate should sync stream_a and stream_b as appropriate (cudaStreamWaitEvent)
// An additional output stream parameter to use for the result
unique_ptr<column> c = cudf::concatenate({a, b}, stream_c);
assert(c->view().stream() == stream_c);

With an interface like this, I don't think column_view would need any ownership of streams, but we could still consider adding an owning stream type to column. For example if the user doesn't provide an output stream, rather than use the default stream, we could return one from an rmm-managed stream pool.

// Columns a and b are each allocated in their own streams from the rmm pool
unique_ptr<column> a = cudf::make_numeric_column(data_type{INT32}, 100);
unique_ptr<column> b = cudf::make_numeric_column(data_type{INT32}, 100);

// Column c is computed in its own stream, which is appropriately synchronized with the
// streams owned by a and b
unique_ptr<column> c = cudf::concatenate({a->view(), b->view()});

// The streams return to the pool when columns a, b, and c go out of scope

@harrism
Copy link
Member

harrism commented May 19, 2020

// Columns a and b are each allocated in their own streams from the rmm pool
unique_ptr<column> a = cudf::make_numeric_column(data_type{INT32}, 100);
unique_ptr<column> b = cudf::make_numeric_column(data_type{INT32}, 100);

I think unnecessarily creating new streams is dangerous, even with a stream pool that makes it cheap. The reason is that the larger the number of streams used to allocate memory asynchronously from a pool allocator, the greater the amount of temporal fragmentation of the pool and the greater the frequency of stream synchronization.

  • Allocations on a stream can reuse blocks freed on the same stream without synchronization, but must synchronize to use blocks freed on other streams
  • Blocks freed on stream A are not coalesced with blocks freed on other streams until one of them is synchronized (so having many streams causes temporal fragmentation)
  • Frequent allocation / freeing across multiple streams could cause "synchronization thrashing" because when stream A steals from stream B stream B is synchronized and all the blocks from stream B's list are moved to stream A's list. Don't want blocks to unnecessarily move from free list to free list...

These are all just current heuristics in the allocator, could change / optimize them as needed. But I want us to be aware of the impacts of automatically creating more streams.

@trevorsm7
Copy link
Contributor

That's a good point, I wasn't considering how pooled allocations would be affected. I'm not necessarily advocating for using a large number of streams, just trying to consider how the API could be adjusted so that it's possible to do so safely. But to your point, maybe it would be better to keep stream parameters explicit instead of having a default to a pool or default stream.

class managed_stream {
 public:
  managed_stream(cudaStream_t stream): _stream{stream} {} // for wrapping non-managed streams
  managed_stream(unique_ptr<stream_resource> owner):
    _stream{owner->stream()}, _owner{std::move(owner)} {}
 
 private:
  cudaStream_t _stream;
  unique_ptr<stream_resource> _owner; // some raii owning type
}

unique_ptr<column> a = cudf::make_numeric_column(data_type{INT32}, 100, managed_stream{stream_a});
unique_ptr<column> b = cudf::make_numeric_column(data_type{INT32}, 100, pool->allocate());
// column b owns the raii owning type, column a owns nothing

Allocations on a stream can reuse blocks freed on the same stream without synchronization, but must synchronize to use blocks freed on other streams

How does rmm currently handle the latter case? I'm thinking it could record an event at the time the block is freed so that subsequent work in the old stream isn't counted against it, and then use wait event in the new stream rather than stream synchronize.

Coalescing is a tricky point though... I suppose after coalescing, it could hold onto a vector of events to wait on when allocating from that block. Maybe it could keep them sorted and remember the address range they belonged to so it could wait only on the range of events overlapping with a fresh allocation from the coalesced block.

But... I'm not sure if having a potentially high volume of events (one per block returned to rmm) hanging around is a sane thing to do. Having a fat list of events after coalescing a bunch of small allocations could be worse than just dealing with stream synchronization. I'm just thinking out loud with only a vague idea of how rmm works internally 😅

@trevorsm7
Copy link
Contributor

@jrhemstad @harrism Have you had any more ideas about streams, either in public APIs or the stream types we discussed for internal APIs?

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Jun 8, 2020

@jrhemstad @harrism Have you had any more ideas about streams, either in public APIs or the stream types we discussed for internal APIs?

Last time we talked about it we decided we would get per-thread default stream working and see how far that gets us.

We still need a strongly typed, non-owning cudaStream_t type.

@harrism
Copy link
Member

harrism commented Jun 9, 2020

I plan to start working on the CUDA event handling needed in pool_memory_resource to allow PTDS very soon.

@harrism
Copy link
Member

harrism commented Jul 1, 2020

Update, making progress: rapidsai/rmm#425

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@harrism
Copy link
Member

harrism commented Mar 16, 2021

This is still (more than ever) relevant.

@bdice bdice changed the title [DOC] Discuss and document expected synchronization behavior of libcudf functions [DOC] Discuss and document expected stream synchronization behavior of libcudf functions Aug 22, 2022
@vyasr vyasr added this to the Enable streams milestone Oct 17, 2022
rapids-bot bot pushed a commit that referenced this issue Oct 18, 2022
This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481.

Partly addresses #5505, #1781.

Resolves #4511.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #11853
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants