Add a CUDA stream pool #659

harrism · 2020-12-14T05:24:23Z

Closes #613. Adds a simple pool class which contains a circular buffer of cuda_stream objects with a get_stream method to get a view of one of the streams. Also adds tests and a simple benchmark.

Benchmark results. First line is getting a stream from the stream pool. Second is using the RAII rmm::cuda_stream() class which calls cudaStreamCreate() and cudaStreamDestroy().

---------------------------------------------------------------------------------
Benchmark                       Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------
BM_StreamPoolGetStream      0.240 us        0.240 us      2918230 items_per_second=4.16296M/s
BM_CudaStreamClass           3.01 us         2.90 us       249073 items_per_second=344.521k/s

TODO in a follow up:

Cython/Python bindings
Stream priorities and non-blocking streams, once support is added to rmm::cuda_stream().

kkraus14

cmake / conda lgtm

include/rmm/cuda_stream_pool.hpp

rongou · 2020-12-14T22:14:24Z

include/rmm/cuda_stream_pool.hpp

+ */
+class cuda_stream_pool {
+ public:
+  static constexpr std::size_t default_size{16};  ///< Default stream pool size


Should this be private?

It's a default parameter to a public constructor, so I assumed it must be public...

The constructor has access to private fields. There is no reason for any caller to change this default, right?

From what I remember, Cython doesn't really work well with default args, so they usually need access to the definition of what is being used as the default.

This is constexpr, so a caller cannot change it. I made it public because I thought it might be useful for users to be able to access the default size at run time.

include/rmm/cuda_stream_pool.hpp

Co-authored-by: Jake Hemstad <[email protected]>

mzient · 2020-12-15T09:24:54Z

include/rmm/cuda_stream_pool.hpp

+ *
+ * Provides efficient access to collection of CUDA stream objects.
+ *
+ * Successive calls may return a `cuda_stream_view` of identical streams. For example, a possible


This makes this pool unusable for DALI - unwittingly submitting new (and potentially urgent) work on a stream that already has hundreds of milliseconds worth of work scheduled is not an option.

You do know that there are hardware limits to concurrency, right? See table 15 in the CUDA programming guide. The current high end GPUs have at most 128 concurrent execution contexts, so using more than 128 streams is redundant -- you will get false dependencies even with more streams.

The size of the pool is configurable. Set it to 128 if you want 128 streams.

We can also discuss what the best default size is before we merge this. I chose 16 because that is the minimum maximum concurrency of GPUs that RAPIDS supports today. I didn't want to create hundreds of streams if the user doesn't need them.

We can also consider a pool that can grow, with a parameterized maximum size.

As we make the pool more complex, we can also have a notion of normal streams and high priority streams, e.g., get_stream and get_priority_stream.

I guess that we have a very different idea of a stream pool then.

My idea (which I'm going to pursue in DALI) was to have an pool where you take streams from the pool and explicitly return them. That's more or less equivalent to create/destroy, but without the ugly handle reuse.

The pool would have two lists:

ready list

free list

When a client requests a stream, it's taken from the ready list. When it's returned to the pool, it's placed in the free list. Then there are several options:

Add a callback to the stream, so that it's returned to the ready list when all work is complete.

Call cudaStreamQuery periodically in a worker thread and collect ready streams.

Call cudaStreamQuery on the streams in the free list when there's no ready stream and return the ready ones to the ready list (this is the least preferred option, as it introduces a potentially large and unpredictable overhead).

In any case, if there are no ready streams and there are no ready streams, a new stream is created and returned.

I did consider a design with get_stream and return_stream, but I think this is early optimization. The present design is extremely simple, and will likely provide good performance for many use cases. If and when we find that the performance is insufficient, we can add complexity as needed. Certainly, a design that requires worker threads is much more complex, and I will not add that complexity to RMM without a real-world benchmark to drive it.

Add cuda_stream_pool and tests

7bdda8e

harrism added feature request New feature or request 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Dec 14, 2020

harrism self-assigned this Dec 14, 2020

harrism requested review from a team as code owners December 14, 2020 05:24

harrism requested review from rongou, codereport and jrhemstad December 14, 2020 05:24

harrism added 3 commits December 14, 2020 17:03

Style

2701973

style

83b400a

Remove FEFF hidden character

8306dc1

kkraus14 approved these changes Dec 14, 2020

View reviewed changes

jrhemstad requested changes Dec 14, 2020

View reviewed changes

include/rmm/cuda_stream_pool.hpp Outdated Show resolved Hide resolved

include/rmm/cuda_stream_pool.hpp Outdated Show resolved Hide resolved

include/rmm/cuda_stream_pool.hpp Outdated Show resolved Hide resolved

ajschmidt8 approved these changes Dec 14, 2020

View reviewed changes

rongou approved these changes Dec 14, 2020

View reviewed changes

harrism and others added 3 commits December 15, 2020 11:16

Update include/rmm/cuda_stream_pool.hpp

8f9ab33

Co-authored-by: Jake Hemstad <[email protected]>

Constification, improve docs

9903cb6

Add section to README.

9c7f957

mzient reviewed Dec 15, 2020

View reviewed changes

afender mentioned this pull request Dec 16, 2020

[FEA] upgrade CUDA stream pool rapidsai/raft#115

Closed

jrhemstad approved these changes Dec 16, 2020

View reviewed changes

harrism added 2 commits December 16, 2020 14:47

Add a stream pool benchmark

a950c7d

cmake style

76825a0

harrism added the 6 - Okay to Auto-Merge label Dec 16, 2020

harrism removed the 3 - Ready for review Ready for review by team label Dec 16, 2020

harrism mentioned this pull request Dec 16, 2020

[FEA] Add Cython / Python wrappers for rmm::cuda_stream_pool #664

Closed

rapids-bot bot merged commit efd4c08 into rapidsai:branch-0.18 Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CUDA stream pool #659

Add a CUDA stream pool #659

harrism commented Dec 14, 2020 •

edited

Loading

kkraus14 left a comment

rongou Dec 14, 2020

harrism Dec 15, 2020

rongou Dec 15, 2020

jrhemstad Dec 15, 2020

harrism Dec 15, 2020

mzient Dec 15, 2020

harrism Dec 15, 2020 •

edited

Loading

jrhemstad Dec 15, 2020 •

edited

Loading

mzient Dec 15, 2020

harrism Dec 16, 2020 •

edited

Loading

Add a CUDA stream pool #659

Add a CUDA stream pool #659

Conversation

harrism commented Dec 14, 2020 • edited Loading

kkraus14 left a comment

Choose a reason for hiding this comment

rongou Dec 14, 2020

Choose a reason for hiding this comment

harrism Dec 15, 2020

Choose a reason for hiding this comment

rongou Dec 15, 2020

Choose a reason for hiding this comment

jrhemstad Dec 15, 2020

Choose a reason for hiding this comment

harrism Dec 15, 2020

Choose a reason for hiding this comment

mzient Dec 15, 2020

Choose a reason for hiding this comment

harrism Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

jrhemstad Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

mzient Dec 15, 2020

Choose a reason for hiding this comment

harrism Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

harrism commented Dec 14, 2020 •

edited

Loading

harrism Dec 15, 2020 •

edited

Loading

jrhemstad Dec 15, 2020 •

edited

Loading

harrism Dec 16, 2020 •

edited

Loading