Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend device_scalar to optionally use pinned bounce buffer #16947

Merged
merged 33 commits into from
Oct 18, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Sep 27, 2024

Description

Depends on #16945

Added cudf::detail::device_scalar, derived from rmm::device_scalar. The new class overrides function members that perform copies between host and device. New implementation uses a cudf::detail::host_vector as a bounce buffer to avoid performing a pageable copy.

Replaced rmm::device_scalar with cudf::detail::device_scalar across libcudf.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 27, 2024
@vuule vuule changed the title device_scalar that optionally uses pinned bounce buffer device_scalar that optionally uses pinned bounce buffer Sep 27, 2024
@vuule vuule changed the title device_scalar that optionally uses pinned bounce buffer Extended device_scalar to optionally uses pinned bounce buffer Sep 27, 2024
@vuule vuule changed the title Extended device_scalar to optionally uses pinned bounce buffer Extended device_scalar to optionally use pinned bounce buffer Sep 27, 2024
@vuule vuule self-assigned this Sep 27, 2024
@vuule vuule added non-breaking Non-breaking change feature request New feature or request Performance Performance related issue labels Sep 27, 2024
@vuule vuule changed the title Extended device_scalar to optionally use pinned bounce buffer Extend device_scalar to optionally use pinned bounce buffer Sep 28, 2024

void set_value_async(T&& value, rmm::cuda_stream_view stream)
{
bounce_buffer[0] = std::move(value);
Copy link
Contributor Author

@vuule vuule Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bonus feature from having a bounce buffer - we don't need to worry about the value lifetime. rmm::device_scalar prohibits passing an rvalue here, but we don't need to.

@davidwendt
Copy link
Contributor

Why the global change? Could we still use rmm::device_scalar when not using a pinned bounce buffer?

@vuule
Copy link
Contributor Author

vuule commented Oct 2, 2024

Why the global change? Could we still use rmm::device_scalar when not using a pinned bounce buffer?

(discussed offline, posting here for viz)
We're trying to avoid any pageable copies because they cause copy engine contention in multi-threaded use cases. Using pinned memory in the bounce_buffer would make this a pinned copy, which is slightly better in general. In addition, some users can choose to perform small copies using a kernel to further avoid the copy engine in their multi-threaded applications.

@vuule vuule marked this pull request as ready for review October 9, 2024 18:06
@vuule vuule requested a review from a team as a code owner October 9, 2024 18:06
@vuule vuule requested review from vyasr and lamarrr October 9, 2024 18:06
cpp/src/join/distinct_hash_join.cu Outdated Show resolved Hide resolved
cpp/include/cudf/detail/device_scalar.hpp Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new class seems fine, but do we really need to use it everywhere? Does this mean that any usage of rmm::device_scalar in cudf is now forbidden because it could introduce unexpected performance overheads? If so, should we include a pre-commit hook or something to that effect to enforce that? Also, if this implication is correct @davidwendt may want to weigh in.

The title says "optionally use pinned bounce buffer", but this usage looks unconditional. Is the optionality encoded in the host vector, or is it simply no longer optional?

@vuule
Copy link
Contributor Author

vuule commented Oct 10, 2024

The title says "optionally use pinned bounce buffer", but this usage looks unconditional. Is the optionality encoded in the host vector, or is it simply no longer optional?

The use of a bounce buffer is unconditional, but host_vector, which is used as a bounce buffer, optionally uses pinned memory.
There's no perf impact when the bounce buffer is pageable. I'll evaluate impact from pinned memory once we're closer to eliminating all "unconditional" pageable memory use (for memory that ends up on the GPU, or copied to from the GPU).

@vuule
Copy link
Contributor Author

vuule commented Oct 10, 2024

The new class seems fine, but do we really need to use it everywhere? Does this mean that any usage of rmm::device_scalar in cudf is now forbidden because it could introduce unexpected performance overheads?

Pretty much. My aim is to stop using rmm::device_scalar outside of public APIs. Same goes for std::vector and thrust::host_vector (if copied to/from GPU). I don't know if we can automate this.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on some extensive offline discussion. We're going to move frward with this and see how this type works in libcudf, then consider upstreaming it to rmm if it's generalizable, and if it's not then looking into developing clear guidelines for when it should be used in cudf.

: rmm::device_scalar<T>{std::move(other)}, bounce_buffer{std::move(other.bounce_buffer)}
{
}
device_scalar& operator=(device_scalar&&) noexcept = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the move ctor required code but this did not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default implementations should be fine in both cases. Compiled fine on 12.5 🤷
I suspect it's an 11.8 compiler bug, but really didn't want to dig into it, with a handy workaround available.

@vuule
Copy link
Contributor Author

vuule commented Oct 18, 2024

/merge

@rapids-bot rapids-bot bot merged commit 98eef67 into rapidsai:branch-24.12 Oct 18, 2024
102 checks passed
@vuule vuule deleted the fea-pinned-aware-device_scalar branch October 18, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants