Kernel copy for pinned memory #15934

vuule · 2024-06-05T20:16:53Z

Description

Added an API that enables users to set the threshold under which we perform pinned memory copies using a kernel. The default threshold is zero, so there's no change in default behavior.
The API currently only impacts hostdevice_vector H<->D synchronization.

The PR adds wrappers for cudaMemcpyAsync so we can implement configurable behavior for pageable copies as well (e.g. copy to pinned + kernel copy).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…fea-pinned-vector-factory

…cudf into fea-pinned-vector-factory

…fea-pinned-vector-factory

…fea-smart-copy

vuule · 2024-06-05T21:51:12Z

impact on the Parquet reader multithreaded benchmark:

Co-authored-by: David Wendt <[email protected]>

…fea-smart-copy

cpp/include/cudf/detail/utilities/cuda_memcpy.hpp

…fea-smart-copy

cpp/CMakeLists.txt

bdice · 2024-06-25T22:56:30Z

cpp/include/cudf/detail/utilities/cuda_memcpy.hpp

+ * @param kind Direction of the copy and type of host memory
+ * @param stream CUDA stream used for the copy
+ */
+void cuda_memcpy_async(


Do we want another name for this, given that it does not always call cudaMemcpyAsync? Proposing: cudf_memcpy_async.

(Happy to go either way on this, the status quo is fine.)

I don't like to include cudf in the name when it's already in the cudf namespace. Named it this way to make it obvious that it replaces the use of cudaMemcpyAsync. That said, I could probably be convinced to rename it, not tied to any specific name.

I'm inclined to agree, I don't like duplicating the namespace name in objects already within the namespace. That only encourages bad practices like using declarations to import the namespace members.

bdice · 2024-06-25T22:58:43Z

cpp/include/cudf/utilities/pinned_memory.hpp

+ * @param threshold The threshold size in bytes. If the size of the copy is less than this
+ * threshold, the copy will be done using kernels. If the size is greater than or equal to this
+ * threshold, the copy will be done using cudaMemcpyAsync.


Are there any "magic" sizes where we expect one strategy to outperform the other? (A page size, a multiple of 1 kiB or similar) Or is this purely empirical?

Fair to say that we don't know what the right value is for this (yet?). It's likely to be empirical, since the only goal is to avoid too many copies going through the copy engine.

Let’s do a sweep over threshold values for the next steps where we enable this more broadly. I would like something closer to a microbenchmark (copy back and forth for different sizes with different thresholds?) than the multithreaded Parquet benchmark.

vyasr · 2024-06-26T00:58:29Z

cpp/include/cudf/detail/utilities/cuda_memcpy.hpp

+
+namespace cudf::detail {
+
+enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE };


I assume we don't care for anything here since I expect that this will stay internal, but user-facing enums we usually provide a storage class.

copy_kind seems somewhat generic, like something that could be in cudf/copying.hpp. Should we be more explicit with something like memcopy_kind?

Sure. It's equivalent to cudaMemcpyKind, so this naming matches better.

renamed to reflect that only host memory type is specified now.

vyasr · 2024-06-26T01:03:18Z

cpp/src/utilities/cuda_memcpy.cu

+void copy_pinned_to_device(void* dst,
+                           void const* src,
+                           std::size_t size,
+                           rmm::cuda_stream_view stream)
+{
+  copy_pinned(dst, src, size, stream);
+}
+
+void copy_device_to_pinned(void* dst,
+                           void const* src,
+                           std::size_t size,
+                           rmm::cuda_stream_view stream)
+{
+  copy_pinned(dst, src, size, stream);
+}


Is the purpose of this transparent passthrough just to have a function name that clearly indicates the direction of the transfer? You still have to get the src/dst order correct, though, so does that really help much? It seems duplicative, especially for something in an anonymous namespace inside detail that you're only using internally.

Same for pageable below.

The reason was that I wanted to allow different behavior for h2d and d2h without changing the header. But now that the entire implementation is in the source file we can simplify this and separate the implementations only when we actually need to.

Agree. I really think you only need one function, no dispatch.

harrism

Can be simplified?

Thinking about this, shouldn't either thrust::copy or cudaMemcpy be responsible for deciding and implementing the fastest copy possible? If not, we should file bugs.

harrism · 2024-06-26T01:17:02Z

cpp/include/cudf/detail/utilities/cuda_memcpy.hpp

+
+namespace cudf::detail {
+
+enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE };


Why is copy_kind needed at all? There is exactly one case (pinned, size less than threshold) where you do anything other than pass through to cudaMemcpyAsync. You can detect that case with cudaPtrGetAttributes and call Thrust for that one case, and just call cudaMemcpyAsync(cudaMemcpyDefault) for everything else.

It's possible that we'll eventually have a separate threshold for pageable copies, where we copy to a pinned buffer and then thrust::copy. @abellina had this in the POC implementation, and IIRC it was helpful even with the extra copy.
I understand current implementation is just a wrapper, I just wanted to leave room for more complex behavior without future changes to the API.

OK I see. Does direction affect the choice at all? Could reduce 4 to 2 cases?

Reduces to two cases; only the host memory type is specified now.
I can also add an AUTO/DEFAULT option that would call cudaPointerGetAttributes. Let me know what you think.

harrism · 2024-06-26T01:18:23Z

cpp/src/utilities/cuda_memcpy.cu

+void copy_pinned_to_device(void* dst,
+                           void const* src,
+                           std::size_t size,
+                           rmm::cuda_stream_view stream)
+{
+  copy_pinned(dst, src, size, stream);
+}
+
+void copy_device_to_pinned(void* dst,
+                           void const* src,
+                           std::size_t size,
+                           rmm::cuda_stream_view stream)
+{
+  copy_pinned(dst, src, size, stream);
+}


Agree. I really think you only need one function, no dispatch.

harrism · 2024-06-26T01:28:00Z

cpp/src/utilities/cuda_memcpy.cu

+  if (kind == copy_kind::PINNED_TO_DEVICE) {
+    copy_pinned_to_device(dst, src, size, stream);
+  } else if (kind == copy_kind::DEVICE_TO_PINNED) {
+    copy_device_to_pinned(dst, src, size, stream);
+  } else if (kind == copy_kind::PAGEABLE_TO_DEVICE) {
+    copy_pageable_to_device(dst, src, size, stream);
+  } else if (kind == copy_kind::DEVICE_TO_PAGEABLE) {
+    copy_device_to_pageable(dst, src, size, stream);
+  }


Suggested change

if (kind == copy_kind::PINNED_TO_DEVICE) {

copy_pinned_to_device(dst, src, size, stream);

} else if (kind == copy_kind::DEVICE_TO_PINNED) {

copy_device_to_pinned(dst, src, size, stream);

} else if (kind == copy_kind::PAGEABLE_TO_DEVICE) {

copy_pageable_to_device(dst, src, size, stream);

} else if (kind == copy_kind::DEVICE_TO_PAGEABLE) {

copy_device_to_pageable(dst, src, size, stream);

}

switch(kind) {

case copy_kind::PINNED_TO_DEVICE:

case copy_kind::DEVICE_TO_PINNED:

copy_pinned(dst, src, size, stream);

case copy_kind::PAGEABLE_TO_DEVICE:

case copy_kind::DEVICE_TO_PAGEABLE:

case default:

copy_pageable(dst, src, size, stream);

but better:

cudaPointerAttributes src_attribs; CUDF_CUDA_TRY(cudaPointerGetAttributes(... &src_attribs)); cudaPointerAttributes dst_attribs; CUDF_CUDA_TRY(cudaPointerGetAttributes(... &dst_attribs)); bool pageable = ((src_attribs.cudaMemoryType == cudaMemoryTypeUnregistered) or (dst_attribs.cudaMemoryType == cudaMemoryTypeUnregistered)); if (pageable and size < get_kernel_pinned_copy_threshold()) { thrust::copy_n(rmm::exec_policy_nosync(stream), static_cast<const char*>(src) size, static_cast<char*>(dst)); } else { CUDF_CUDA_TRY(cudaMemcpyAsync(dst, src, size, cudaMemcpyDefault, stream)); }

I was told that cudaPointerGetAttributes is not trivial, so I'm trying to avoid calling it for every copy. Also, FWIW tying the strategy to the memory type prevents callers from manually overriding the strategy.
Current API is awkward to use when copying from an existing cudf::host_vector, so I'm not sure what's the best option here.

vuule · 2024-06-26T01:48:49Z

Can be simplified?

Thinking about this, shouldn't either thrust::copy or cudaMemcpy be responsible for deciding and implementing the fastest copy possible? If not, we should file bugs.

The fastest copy possible depends on the context. The goal here is not to implement SOL copy, but to reduce the copy engine bottleneck in multi-threaded environment (e.g. Spark), and thrust::copy and cudaMemcpy don't have this context.

…ea-smart-copy

cpp/src/utilities/cuda_memcpy.cu

harrism

Thanks for simplifying!

…ea-smart-copy

vuule · 2024-06-27T02:44:49Z

/merge

vuule added 23 commits May 30, 2024 16:24

remove pinned_host_vector

eb39019

switch to host_device resource ref

24b1245

rebrand host memory resource

6c896f6

style

0048c59

java update because breaking

1964523

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

f871ca0

…fea-pinned-vector-factory

java fix

ac0ce9c

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

b610ba3

…fea-pinned-vector-factory

move test out of io util

ab36162

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

69a1bce

…fea-pinned-vector-factory

missed rename

83f665a

Merge branch 'branch-24.08' into fea-pinned-vector-factory

659cabc

update benchmark changes

c1ae478

Merge branch 'fea-pinned-vector-factory' of https://github.com/vuule/…

b1a1582

…cudf into fea-pinned-vector-factory

Merge branch 'branch-24.08' into fea-pinned-vector-factory

707dfc7

rename rmm_host_vector

1c09d0c

remove do_xyz

c343c31

Merge branch 'fea-pinned-vector-factory' of https://github.com/vuule/…

25ddc4f

…cudf into fea-pinned-vector-factory

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

3fc988b

…fea-pinned-vector-factory

comment

50f4d3e

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

8dfbd07

…fea-smart-copy

Merge branch 'fea-pinned-vector-factory' into fea-smart-copy

e429840

works

e5af490

vuule self-assigned this Jun 5, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. labels Jun 5, 2024

vuule and others added 2 commits June 5, 2024 15:14

include style

9082ccc

Co-authored-by: David Wendt <[email protected]>

Merge branch 'branch-24.08' into fea-pinned-vector-factory

054a98a

vuule added 2 commits June 24, 2024 10:45

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

d50f145

…fea-smart-copy

fix typo

0a2742f

abellina reviewed Jun 24, 2024

View reviewed changes

cpp/include/cudf/detail/utilities/cuda_memcpy.hpp Outdated Show resolved Hide resolved

abellina reviewed Jun 24, 2024

View reviewed changes

cpp/include/cudf/detail/utilities/cuda_memcpy.hpp Outdated Show resolved Hide resolved

vuule added 3 commits June 24, 2024 11:47

typeless API

68a03f1

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

b63b393

…fea-smart-copy

Merge branch 'branch-24.08' of https://github.com/rapidsai/cudf into …

336c7e0

…fea-smart-copy

bdice approved these changes Jun 25, 2024

View reviewed changes

sorthidth

1741037

vyasr reviewed Jun 26, 2024

View reviewed changes

harrism requested changes Jun 26, 2024

View reviewed changes

vuule added 2 commits June 25, 2024 23:26

simplify

fff667b

Merge branch 'fea-smart-copy' of https://github.com/vuule/cudf into f…

da2c009

…ea-smart-copy

harrism reviewed Jun 26, 2024

View reviewed changes

cpp/src/utilities/cuda_memcpy.cu Outdated Show resolved Hide resolved

vuule added 2 commits June 26, 2024 09:34

add missing break

1bbd574

Merge branch 'branch-24.08' into fea-smart-copy

692f775

vuule requested review from vyasr and harrism June 26, 2024 16:35

vyasr approved these changes Jun 26, 2024

View reviewed changes

vuule added 2 commits June 26, 2024 17:00

lines

ce58c46

Merge branch 'branch-24.08' into fea-smart-copy

d897984

harrism approved these changes Jun 27, 2024

View reviewed changes

vuule added 3 commits June 26, 2024 18:11

use if/else

49d65b8

Merge branch 'fea-smart-copy' of https://github.com/vuule/cudf into f…

84a1797

…ea-smart-copy

Merge branch 'branch-24.08' into fea-smart-copy

0b2aa13

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jun 27, 2024

rapids-bot bot merged commit f267b1f into rapidsai:branch-24.08 Jun 27, 2024
74 checks passed

vuule deleted the fea-smart-copy branch June 27, 2024 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel copy for pinned memory #15934

Kernel copy for pinned memory #15934

vuule commented Jun 5, 2024 •

edited

Loading

vuule commented Jun 5, 2024

bdice Jun 25, 2024 •

edited

Loading

vuule Jun 25, 2024

vyasr Jun 26, 2024 •

edited

Loading

bdice Jun 25, 2024

vuule Jun 25, 2024

bdice Jun 25, 2024 •

edited

Loading

vyasr Jun 26, 2024

vyasr Jun 26, 2024

vuule Jun 26, 2024

vuule Jun 26, 2024

vyasr Jun 26, 2024

vuule Jun 26, 2024

harrism Jun 26, 2024

harrism left a comment •

edited

Loading

harrism Jun 26, 2024

vuule Jun 26, 2024

harrism Jun 26, 2024

vuule Jun 26, 2024

harrism Jun 26, 2024

harrism Jun 26, 2024

vuule Jun 26, 2024

vuule commented Jun 26, 2024

harrism left a comment

vuule commented Jun 27, 2024


		namespace cudf::detail {

		enum class copy_kind { PINNED_TO_DEVICE, DEVICE_TO_PINNED, PAGEABLE_TO_DEVICE, DEVICE_TO_PAGEABLE };

Kernel copy for pinned memory #15934

Kernel copy for pinned memory #15934

Conversation

vuule commented Jun 5, 2024 • edited Loading

Description

Checklist

vuule commented Jun 5, 2024

bdice Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule commented Jun 26, 2024

harrism left a comment

Choose a reason for hiding this comment

vuule commented Jun 27, 2024

vuule commented Jun 5, 2024 •

edited

Loading

bdice Jun 25, 2024 •

edited

Loading

vyasr Jun 26, 2024 •

edited

Loading

bdice Jun 25, 2024 •

edited

Loading

harrism left a comment •

edited

Loading