-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a CUDA stream pool #659
Changes from all commits
7bdda8e
2701973
83b400a
8306dc1
8f9ab33
9903cb6
9c7f957
a950c7d
76825a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
/* | ||
* Copyright (c) 2020, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either ex ess or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
#include <benchmark/benchmark.h> | ||
|
||
#include <rmm/cuda_stream_pool.hpp> | ||
#include <rmm/detail/error.hpp> | ||
|
||
#include <cuda_runtime_api.h> | ||
|
||
#include <stdexcept> | ||
|
||
static void BM_StreamPoolGetStream(benchmark::State& state) | ||
{ | ||
rmm::cuda_stream_pool stream_pool{}; | ||
|
||
for (auto _ : state) { | ||
auto s = stream_pool.get_stream(); | ||
auto e = cudaStreamQuery(s.value()); | ||
} | ||
|
||
state.SetItemsProcessed(state.iterations()); | ||
} | ||
BENCHMARK(BM_StreamPoolGetStream)->Unit(benchmark::kMicrosecond); | ||
|
||
static void BM_CudaStreamClass(benchmark::State& state) | ||
{ | ||
for (auto _ : state) { | ||
auto s = rmm::cuda_stream{}; | ||
auto e = cudaStreamQuery(s.view().value()); | ||
} | ||
|
||
state.SetItemsProcessed(state.iterations()); | ||
} | ||
BENCHMARK(BM_CudaStreamClass)->Unit(benchmark::kMicrosecond); | ||
|
||
BENCHMARK_MAIN(); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
/* | ||
* Copyright (c) 2020, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
#pragma once | ||
|
||
#include <rmm/cuda_stream.hpp> | ||
#include <rmm/cuda_stream_view.hpp> | ||
|
||
#include <atomic> | ||
#include <vector> | ||
|
||
namespace rmm { | ||
|
||
/** | ||
* @brief A pool of CUDA streams. | ||
* | ||
* Provides efficient access to collection of CUDA stream objects. | ||
* | ||
* Successive calls may return a `cuda_stream_view` of identical streams. For example, a possible | ||
* implementation is to maintain a circular buffer of `cuda_stream` objects. | ||
*/ | ||
class cuda_stream_pool { | ||
public: | ||
static constexpr std::size_t default_size{16}; ///< Default stream pool size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be private? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a default parameter to a public constructor, so I assumed it must be public... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The constructor has access to private fields. There is no reason for any caller to change this default, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From what I remember, Cython doesn't really work well with default args, so they usually need access to the definition of what is being used as the default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is constexpr, so a caller cannot change it. I made it public because I thought it might be useful for users to be able to access the default size at run time. |
||
|
||
/** | ||
* @brief Construct a new cuda stream pool object of the given size | ||
* | ||
* @param pool_size The number of streams in the pool | ||
*/ | ||
explicit cuda_stream_pool(std::size_t pool_size = default_size) : streams_(pool_size) {} | ||
~cuda_stream_pool() = default; | ||
|
||
cuda_stream_pool(cuda_stream_pool&&) = delete; | ||
cuda_stream_pool(cuda_stream_pool const&) = delete; | ||
cuda_stream_pool& operator=(cuda_stream_pool&&) = delete; | ||
cuda_stream_pool& operator=(cuda_stream_pool const&) = delete; | ||
|
||
/** | ||
* @brief Get a `cuda_stream_view` of a stream in the pool. | ||
* | ||
* This function is thread safe with respect to other calls to the same function. | ||
* | ||
* @return rmm::cuda_stream_view | ||
*/ | ||
rmm::cuda_stream_view get_stream() const noexcept | ||
{ | ||
return streams_[(next_stream++) % streams_.size()].view(); | ||
} | ||
|
||
private: | ||
std::vector<rmm::cuda_stream> streams_; | ||
mutable std::atomic_size_t next_stream{}; | ||
}; | ||
|
||
} // namespace rmm |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
/* | ||
* Copyright (c) 2020, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
#include <rmm/cuda_stream_pool.hpp> | ||
#include <rmm/detail/error.hpp> | ||
#include <rmm/device_uvector.hpp> | ||
|
||
#include <gtest/gtest.h> | ||
|
||
#include <cuda_runtime_api.h> | ||
|
||
struct CudaStreamPoolTest : public ::testing::Test { | ||
rmm::cuda_stream_pool pool{}; | ||
}; | ||
|
||
TEST_F(CudaStreamPoolTest, Unequal) | ||
{ | ||
auto const stream_a = this->pool.get_stream(); | ||
auto const stream_b = this->pool.get_stream(); | ||
|
||
EXPECT_NE(stream_a, stream_b); | ||
} | ||
|
||
TEST_F(CudaStreamPoolTest, Nondefault) | ||
{ | ||
auto const stream_a = this->pool.get_stream(); | ||
auto const stream_b = this->pool.get_stream(); | ||
|
||
// pool streams are explicit, non-default streams | ||
EXPECT_FALSE(stream_a.is_default()); | ||
EXPECT_FALSE(stream_a.is_per_thread_default()); | ||
} | ||
|
||
TEST_F(CudaStreamPoolTest, ValidStreams) | ||
{ | ||
auto const stream_a = this->pool.get_stream(); | ||
auto const stream_b = this->pool.get_stream(); | ||
|
||
// Operations on the streams should work correctly and without throwing exceptions | ||
auto v = rmm::device_uvector<std::uint8_t>{100, stream_a}; | ||
RMM_CUDA_TRY(cudaMemsetAsync(v.data(), 0xcc, 100, stream_a.value())); | ||
stream_a.synchronize(); | ||
|
||
auto v2 = rmm::device_uvector<uint8_t>{v, stream_b}; | ||
auto x = v2.front_element(stream_b); | ||
EXPECT_EQ(x, 0xcc); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes this pool unusable for DALI - unwittingly submitting new (and potentially urgent) work on a stream that already has hundreds of milliseconds worth of work scheduled is not an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do know that there are hardware limits to concurrency, right? See table 15 in the CUDA programming guide. The current high end GPUs have at most 128 concurrent execution contexts, so using more than 128 streams is redundant -- you will get false dependencies even with more streams.
The size of the pool is configurable. Set it to 128 if you want 128 streams.
We can also discuss what the best default size is before we merge this. I chose 16 because that is the minimum maximum concurrency of GPUs that RAPIDS supports today. I didn't want to create hundreds of streams if the user doesn't need them.
We can also consider a pool that can grow, with a parameterized maximum size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we make the pool more complex, we can also have a notion of normal streams and high priority streams, e.g.,
get_stream
andget_priority_stream
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that we have a very different idea of a stream pool then.
My idea (which I'm going to pursue in DALI) was to have an pool where you take streams from the pool and explicitly return them. That's more or less equivalent to create/destroy, but without the ugly handle reuse.
The pool would have two lists:
When a client requests a stream, it's taken from the ready list. When it's returned to the pool, it's placed in the free list. Then there are several options:
In any case, if there are no ready streams and there are no ready streams, a new stream is created and returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider a design with get_stream and return_stream, but I think this is early optimization. The present design is extremely simple, and will likely provide good performance for many use cases. If and when we find that the performance is insufficient, we can add complexity as needed. Certainly, a design that requires worker threads is much more complex, and I will not add that complexity to RMM without a real-world benchmark to drive it.