-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
cudaEvent_t get_event(cudaStream_t stream) | ||
{ | ||
#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM | ||
if (stream == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to do something other than check against 0
for the default stream. Someone could pass cudaStreamPerThread
explicitly, which is different from 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thought about that. Need to check for either 0 or cudaStreamPerThread, good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some optimization ideas that are orthogonal to PTDS, but the benefit is that it should improve performance for both PTDS and non-PTDS mode.
stream_free_blocks_[stream_event].insert(blocks.begin(), blocks.end()); | ||
stream_free_blocks_.erase(blocks_event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there is a potential optimization here. We're merging two sorted std::list
s and coalescing adjacent blocks. If we ignored the fact that we're coalescing blocks, you could just use std::list::merge
.
The current implementation is O(m*n)
as it requires doing a linear search of the destination list for every element in the source list. It seems like we should be able to make that be O(m + n)
.
At the very least, I think we should add a free_list::merge(free_list&& other)
function and use that here. Then we can explore doing something more optimal that exploits the fact that both lists are already sorted by pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option would be to just use std::list::merge
and ignore coalescing, and then do a second pass through the merged list and coalesce any adjacent blocks. But that operation likely can be fused with a custom merge algorithm.
…re the same infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not finished reviewing; flushing some comments I had so far.
|
||
TYPED_TEST(MRTest_mt, AllocFreeDifferentThreadsSameStream) | ||
{ | ||
test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be:
test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream); | |
test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream, this->stream); |
otherwise streamB will default to 0 and this does not test different threads on the same stream as the test name implies. I think the default arguments for test_allocate_free_different_threads
are causing more harm than good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I patched in this PR and ran the TPCH job a bunch of times. I didn't see the occasional memory errors I was getting from CNMeM. Looks solid! One thing that threw me for a loop was that after I changed the JNI code to use |
Yup, once CNMeM is gone rmm will move to being a header only C++ library 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a copyright that can be updated. Looks good otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new design inextricably links a single CUDA event to every stream (default or otherwise). However, there are a few places where those two pieces of state (the stream and its event) are allowed to be independent variables. I think it would make the code clearer and safer to keep those things tightly coupled when it's expected that they relate to each other by adding a simple wrapper class for a stream and an event:
struct stream_event{
cudaStream_t stream;
cudaEvent_t event;
};
This closes #410 but in a different (better) way than #416. This PR uses thread-local storage to create a unique CUDA event for each thread's default-stream in the PTDS case. It also attempts to unify the code paths for PTDS and non-PTDS as much as possible; however the overhead of calling
cudaEventRecord
on every deallocation proves expensive so it is currently disabled in the non-PTDS case. (#416 instead creates an event for every free'd block and maintains lists of events that must be synced when memory is allocated from a coalesced block. These lists proved to have high performance and code complexity overhead.)This PR also
pool_memory_resource
thread safe (sothread_safe_resource_adaptor
is no longer required to use this MR)TODO: benchmark with a playback of a multithreaded/PTDS log.
In benchmarks, I find that when compiled without PTDS support, this version is nearly the same performance as before. When compiled with PTDS support but run on a single thread (using only stream 0), performance is degraded by at most 2.4x, and in most cases at most 2x.
Compiled without
--default-stream-per-thread
, here we compare performance of the original branch-0.15pool_memory_resource
to this PR (numbers are % increase in run time -- so mostly within 6%).Compiled with
--default-stream-per-thread
:In comparison to cnmem compiled with PTDS on, pool_mr can be up to 2.3x slower, but on the other hand it can be nearly 2x faster. And I think in real apps it will be better because cnmem synchronizes the whole device on every allocation when PTDS is enabled, whereas pool_mr will only insert cudaStreamWaitEvent calls.