Extend `device_scalar` to optionally use pinned bounce buffer #16947

vuule · 2024-09-27T18:32:23Z

Description

Depends on #16945

Added cudf::detail::device_scalar, derived from rmm::device_scalar. The new class overrides function members that perform copies between host and device. New implementation uses a cudf::detail::host_vector as a bounce buffer to avoid performing a pageable copy.

Replaced rmm::device_scalar with cudf::detail::device_scalar across libcudf.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…fea-pinned-aware-device_scalar

…fea-host-device-copy

…fea-pinned-aware-device_scalar

…uule/cudf into fea-pinned-aware-device_scalar

…fea-pinned-aware-device_scalar

vuule · 2024-09-30T17:09:08Z

cpp/include/cudf/detail/device_scalar.hpp

+
+  void set_value_async(T&& value, rmm::cuda_stream_view stream)
+  {
+    bounce_buffer[0] = std::move(value);


bonus feature from having a bounce buffer - we don't need to worry about the value lifetime. rmm::device_scalar prohibits passing an rvalue here, but we don't need to.

…fea-host-device-copy

…fea-pinned-aware-device_scalar

davidwendt · 2024-10-02T13:12:44Z

Why the global change? Could we still use rmm::device_scalar when not using a pinned bounce buffer?

vuule · 2024-10-02T20:23:29Z

Why the global change? Could we still use rmm::device_scalar when not using a pinned bounce buffer?

(discussed offline, posting here for viz)
We're trying to avoid any pageable copies because they cause copy engine contention in multi-threaded use cases. Using pinned memory in the bounce_buffer would make this a pinned copy, which is slightly better in general. In addition, some users can choose to perform small copies using a kernel to further avoid the copy engine in their multi-threaded applications.

cpp/src/join/distinct_hash_join.cu

cpp/include/cudf/detail/device_scalar.hpp

vyasr

The new class seems fine, but do we really need to use it everywhere? Does this mean that any usage of rmm::device_scalar in cudf is now forbidden because it could introduce unexpected performance overheads? If so, should we include a pre-commit hook or something to that effect to enforce that? Also, if this implication is correct @davidwendt may want to weigh in.

The title says "optionally use pinned bounce buffer", but this usage looks unconditional. Is the optionality encoded in the host vector, or is it simply no longer optional?

vuule · 2024-10-10T23:35:22Z

The title says "optionally use pinned bounce buffer", but this usage looks unconditional. Is the optionality encoded in the host vector, or is it simply no longer optional?

The use of a bounce buffer is unconditional, but host_vector, which is used as a bounce buffer, optionally uses pinned memory.
There's no perf impact when the bounce buffer is pageable. I'll evaluate impact from pinned memory once we're closer to eliminating all "unconditional" pageable memory use (for memory that ends up on the GPU, or copied to from the GPU).

vuule · 2024-10-10T23:41:50Z

The new class seems fine, but do we really need to use it everywhere? Does this mean that any usage of rmm::device_scalar in cudf is now forbidden because it could introduce unexpected performance overheads?

Pretty much. My aim is to stop using rmm::device_scalar outside of public APIs. Same goes for std::vector and thrust::host_vector (if copied to/from GPU). I don't know if we can automate this.

vyasr

Approving based on some extensive offline discussion. We're going to move frward with this and see how this type works in libcudf, then consider upstreaming it to rmm if it's generalizable, and if it's not then looking into developing clear guidelines for when it should be used in cudf.

davidwendt · 2024-10-18T19:19:14Z

cpp/include/cudf/detail/device_scalar.hpp

+    : rmm::device_scalar<T>{std::move(other)}, bounce_buffer{std::move(other.bounce_buffer)}
+  {
+  }
+  device_scalar& operator=(device_scalar&&) noexcept = default;


Curious why the move ctor required code but this did not?

Default implementations should be fine in both cases. Compiled fine on 12.5 🤷
I suspect it's an 11.8 compiler bug, but really didn't want to dig into it, with a handy workaround available.

vuule · 2024-10-18T21:55:36Z

/merge

vuule added 12 commits September 25, 2024 18:43

class with overridden value(); single use

e542de1

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

0216948

…fea-pinned-aware-device_scalar

clean up + ctor

b74a4d0

stop using to_device

b955404

use in cuIO

20633fe

rest of src

fccd97d

include

ab18eb9

initial impl

a272fe7

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

7c1c918

…fea-host-device-copy

rework API

c0a2e71

impl fix

db97c3d

docs

80047eb

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 27, 2024

vuule changed the title ~~device_scalar that optionally uses pinned bounce buffer~~ device_scalar that optionally uses pinned bounce buffer Sep 27, 2024

vuule changed the title ~~device_scalar that optionally uses pinned bounce buffer~~ Extended device_scalar to optionally uses pinned bounce buffer Sep 27, 2024

vuule changed the title ~~Extended device_scalar to optionally uses pinned bounce buffer~~ Extended device_scalar to optionally use pinned bounce buffer Sep 27, 2024

vuule self-assigned this Sep 27, 2024

vuule added non-breaking Non-breaking change feature request New feature or request Performance Performance related issue labels Sep 27, 2024

vuule added 9 commits September 27, 2024 13:06

throw when mismatched sizes

6cf40b3

Merge branch 'branch-24.12' into fea-host-device-copy

7414926

Merge branch 'fea-host-device-copy' into fea-pinned-aware-device_scalar

4957e27

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

5997049

…fea-pinned-aware-device_scalar

fix

4030a89

bounce buffer

243e12e

Merge branch 'branch-24.12' into fea-pinned-aware-device_scalar

09b329a

style

381a49a

Merge branch 'fea-pinned-aware-device_scalar' of https://github.com/v…

7e9eb33

…uule/cudf into fea-pinned-aware-device_scalar

vuule changed the title ~~Extended device_scalar to optionally use pinned bounce buffer~~ Extend device_scalar to optionally use pinned bounce buffer Sep 28, 2024

vuule added 2 commits September 30, 2024 09:58

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

ac13130

…fea-pinned-aware-device_scalar

style

a052495

vuule commented Sep 30, 2024

View reviewed changes

vuule added 6 commits October 1, 2024 10:54

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

d584fcc

…fea-host-device-copy

remove impl namespace

f6a9266

Merge branch 'fea-host-device-copy' into fea-pinned-aware-device_scalar

d14f371

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

2056834

…fea-pinned-aware-device_scalar

fix minmax with move ctor definition

931265b

Merge branch 'branch-24.12' into fea-pinned-aware-device_scalar

e287b89

vyasr and others added 2 commits October 3, 2024 10:59

Merge branch 'branch-24.12' into fea-pinned-aware-device_scalar

2468636

Merge branch 'branch-24.12' into fea-pinned-aware-device_scalar

483ce19

vuule marked this pull request as ready for review October 9, 2024 18:06

vuule requested a review from a team as a code owner October 9, 2024 18:06

vuule requested review from vyasr and lamarrr October 9, 2024 18:06

lamarrr approved these changes Oct 10, 2024

View reviewed changes

cpp/src/join/distinct_hash_join.cu Outdated Show resolved Hide resolved

cpp/include/cudf/detail/device_scalar.hpp Show resolved Hide resolved

vuule added 2 commits October 10, 2024 15:21

code review

80dc0e5

Merge branch 'branch-24.12' into fea-pinned-aware-device_scalar

3d63126

vyasr reviewed Oct 10, 2024

View reviewed changes

vyasr approved these changes Oct 18, 2024

View reviewed changes

davidwendt reviewed Oct 18, 2024

View reviewed changes

davidwendt approved these changes Oct 18, 2024

View reviewed changes

rapids-bot bot merged commit 98eef67 into rapidsai:branch-24.12 Oct 18, 2024
102 checks passed

vuule deleted the fea-pinned-aware-device_scalar branch October 18, 2024 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend `device_scalar` to optionally use pinned bounce buffer #16947

Extend `device_scalar` to optionally use pinned bounce buffer #16947

vuule commented Sep 27, 2024 •

edited

Loading

vuule Sep 30, 2024 •

edited

Loading

davidwendt commented Oct 2, 2024

vuule commented Oct 2, 2024

vyasr left a comment

vuule commented Oct 10, 2024

vuule commented Oct 10, 2024

vyasr left a comment

davidwendt Oct 18, 2024

vuule Oct 18, 2024

vuule commented Oct 18, 2024

Extend device_scalar to optionally use pinned bounce buffer #16947

Extend device_scalar to optionally use pinned bounce buffer #16947

Conversation

vuule commented Sep 27, 2024 • edited Loading

Description

Checklist

vuule Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

davidwendt commented Oct 2, 2024

vuule commented Oct 2, 2024

vyasr left a comment

Choose a reason for hiding this comment

vuule commented Oct 10, 2024

vuule commented Oct 10, 2024

vyasr left a comment

Choose a reason for hiding this comment

davidwendt Oct 18, 2024

Choose a reason for hiding this comment

vuule Oct 18, 2024

Choose a reason for hiding this comment

vuule commented Oct 18, 2024

Extend `device_scalar` to optionally use pinned bounce buffer #16947

Extend `device_scalar` to optionally use pinned bounce buffer #16947

vuule commented Sep 27, 2024 •

edited

Loading

vuule Sep 30, 2024 •

edited

Loading