-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel copy for pinned memory #15934
Conversation
Co-authored-by: David Wendt <[email protected]>
* @param kind Direction of the copy and type of host memory | ||
* @param stream CUDA stream used for the copy | ||
*/ | ||
void cuda_memcpy_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want another name for this, given that it does not always call cudaMemcpyAsync
? Proposing: cudf_memcpy_async
.
(Happy to go either way on this, the status quo is fine.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like to include cudf in the name when it's already in the cudf
namespace. Named it this way to make it obvious that it replaces the use of cudaMemcpyAsync
. That said, I could probably be convinced to rename it, not tied to any specific name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to agree, I don't like duplicating the namespace name in objects already within the namespace. That only encourages bad practices like using declarations to import the namespace members.
* @param threshold The threshold size in bytes. If the size of the copy is less than this | ||
* threshold, the copy will be done using kernels. If the size is greater than or equal to this | ||
* threshold, the copy will be done using cudaMemcpyAsync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any "magic" sizes where we expect one strategy to outperform the other? (A page size, a multiple of 1 kiB or similar) Or is this purely empirical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair to say that we don't know what the right value is for this (yet?). It's likely to be empirical, since the only goal is to avoid too many copies going through the copy engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s do a sweep over threshold values for the next steps where we enable this more broadly. I would like something closer to a microbenchmark (copy back and forth for different sizes with different thresholds?) than the multithreaded Parquet benchmark.
|
||
namespace cudf::detail { | ||
|
||
enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't care for anything here since I expect that this will stay internal, but user-facing enums we usually provide a storage class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_kind
seems somewhat generic, like something that could be in cudf/copying.hpp. Should we be more explicit with something like memcopy_kind
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It's equivalent to cudaMemcpyKind, so this naming matches better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to reflect that only host memory type is specified now.
cpp/src/utilities/cuda_memcpy.cu
Outdated
void copy_pinned_to_device(void* dst, | ||
void const* src, | ||
std::size_t size, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
copy_pinned(dst, src, size, stream); | ||
} | ||
|
||
void copy_device_to_pinned(void* dst, | ||
void const* src, | ||
std::size_t size, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
copy_pinned(dst, src, size, stream); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the purpose of this transparent passthrough just to have a function name that clearly indicates the direction of the transfer? You still have to get the src/dst order correct, though, so does that really help much? It seems duplicative, especially for something in an anonymous namespace inside detail that you're only using internally.
Same for pageable below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was that I wanted to allow different behavior for h2d and d2h without changing the header. But now that the entire implementation is in the source file we can simplify this and separate the implementations only when we actually need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I really think you only need one function, no dispatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified?
Thinking about this, shouldn't either thrust::copy or cudaMemcpy be responsible for deciding and implementing the fastest copy possible? If not, we should file bugs.
|
||
namespace cudf::detail { | ||
|
||
enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is copy_kind
needed at all? There is exactly one case (pinned, size less than threshold) where you do anything other than pass through to cudaMemcpyAsync
. You can detect that case with cudaPtrGetAttributes
and call Thrust for that one case, and just call cudaMemcpyAsync(cudaMemcpyDefault)
for everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that we'll eventually have a separate threshold for pageable copies, where we copy to a pinned buffer and then thrust::copy
. @abellina had this in the POC implementation, and IIRC it was helpful even with the extra copy.
I understand current implementation is just a wrapper, I just wanted to leave room for more complex behavior without future changes to the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. Does direction affect the choice at all? Could reduce 4 to 2 cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduces to two cases; only the host memory type is specified now.
I can also add an AUTO/DEFAULT option that would call cudaPointerGetAttributes
. Let me know what you think.
cpp/src/utilities/cuda_memcpy.cu
Outdated
void copy_pinned_to_device(void* dst, | ||
void const* src, | ||
std::size_t size, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
copy_pinned(dst, src, size, stream); | ||
} | ||
|
||
void copy_device_to_pinned(void* dst, | ||
void const* src, | ||
std::size_t size, | ||
rmm::cuda_stream_view stream) | ||
{ | ||
copy_pinned(dst, src, size, stream); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I really think you only need one function, no dispatch.
cpp/src/utilities/cuda_memcpy.cu
Outdated
if (kind == copy_kind::PINNED_TO_DEVICE) { | ||
copy_pinned_to_device(dst, src, size, stream); | ||
} else if (kind == copy_kind::DEVICE_TO_PINNED) { | ||
copy_device_to_pinned(dst, src, size, stream); | ||
} else if (kind == copy_kind::PAGEABLE_TO_DEVICE) { | ||
copy_pageable_to_device(dst, src, size, stream); | ||
} else if (kind == copy_kind::DEVICE_TO_PAGEABLE) { | ||
copy_device_to_pageable(dst, src, size, stream); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (kind == copy_kind::PINNED_TO_DEVICE) { | |
copy_pinned_to_device(dst, src, size, stream); | |
} else if (kind == copy_kind::DEVICE_TO_PINNED) { | |
copy_device_to_pinned(dst, src, size, stream); | |
} else if (kind == copy_kind::PAGEABLE_TO_DEVICE) { | |
copy_pageable_to_device(dst, src, size, stream); | |
} else if (kind == copy_kind::DEVICE_TO_PAGEABLE) { | |
copy_device_to_pageable(dst, src, size, stream); | |
} | |
switch(kind) { | |
case copy_kind::PINNED_TO_DEVICE: | |
case copy_kind::DEVICE_TO_PINNED: | |
copy_pinned(dst, src, size, stream); | |
case copy_kind::PAGEABLE_TO_DEVICE: | |
case copy_kind::DEVICE_TO_PAGEABLE: | |
case default: | |
copy_pageable(dst, src, size, stream); |
but better:
cudaPointerAttributes src_attribs;
CUDF_CUDA_TRY(cudaPointerGetAttributes(... &src_attribs));
cudaPointerAttributes dst_attribs;
CUDF_CUDA_TRY(cudaPointerGetAttributes(... &dst_attribs));
bool pageable = ((src_attribs.cudaMemoryType == cudaMemoryTypeUnregistered) or
(dst_attribs.cudaMemoryType == cudaMemoryTypeUnregistered));
if (pageable and size < get_kernel_pinned_copy_threshold()) {
thrust::copy_n(rmm::exec_policy_nosync(stream),
static_cast<const char*>(src)
size,
static_cast<char*>(dst));
} else {
CUDF_CUDA_TRY(cudaMemcpyAsync(dst, src, size, cudaMemcpyDefault, stream));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was told that cudaPointerGetAttributes
is not trivial, so I'm trying to avoid calling it for every copy. Also, FWIW tying the strategy to the memory type prevents callers from manually overriding the strategy.
Current API is awkward to use when copying from an existing cudf::host_vector
, so I'm not sure what's the best option here.
The fastest copy possible depends on the context. The goal here is not to implement SOL copy, but to reduce the copy engine bottleneck in multi-threaded environment (e.g. Spark), and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for simplifying!
/merge |
Description
Issue #15620
Added an API that enables users to set the threshold under which we perform pinned memory copies using a kernel. The default threshold is zero, so there's no change in default behavior.
The API currently only impacts
hostdevice_vector
H<->D synchronization.The PR adds wrappers for
cudaMemcpyAsync
so we can implement configurable behavior for pageable copies as well (e.g. copy to pinned + kernel copy).Checklist