Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) #17132

Merged
merged 13 commits into from
Oct 23, 2024

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Oct 18, 2024

Description

Issue #15620

Replaced the calls to cudaMemcpyAsync with the new cuda_memcpy/cuda_memcpy_async utility, which optionally avoids using the copy engine. Changes are limited to cuIO to make the PR easier to review (repetitive enough as-is!).

Also took the opportunity to use cudf::detail::host_vector and its factories to enable wider pinned memory use.

Skipped a few instances of cudaMemcpyAsync; few are under io::comp, which we don't want to invest in further (if possible). The other cudaMemcpyAsync instances are D2D copies, which cuda_memcpy/cuda_memcpy_async don't support. Perhaps they should, just to make the use ubiquitous.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 18, 2024
@vuule vuule self-assigned this Oct 18, 2024
@vuule vuule added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 18, 2024
@@ -218,7 +218,7 @@ void generate_depth_remappings(
*/
[[nodiscard]] std::future<void> read_column_chunks_async(
std::vector<std::unique_ptr<datasource>> const& sources,
std::vector<std::unique_ptr<datasource::buffer>>& page_data,
cudf::host_span<rmm::device_buffer> page_data,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified outdated complexity

@vuule vuule changed the title Replace cudaMemcpyAsync calls with cuda_memcpy_async Replace direct cudaMemcpyAsync calls with utility functions Oct 18, 2024
@vuule vuule changed the title Replace direct cudaMemcpyAsync calls with utility functions Replace direct cudaMemcpyAsync calls with utility functions (limited to cudf::io) Oct 18, 2024
std::pair(source_ptr->device_read_async(
read_info.offset, read_info.length, dst_base + read_info.dst_pos, _stream),
read_info.length));
device_read_tasks.emplace_back(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change; noticed clang-tidy complaining that we used to make an unnecessary move here :)

@vuule vuule marked this pull request as ready for review October 22, 2024 22:11
@vuule vuule requested a review from a team as a code owner October 22, 2024 22:11
@ttnghia
Copy link
Contributor

ttnghia commented Oct 23, 2024

Do you have (run) any benchmark to make sure there is no regression?

@vuule
Copy link
Contributor Author

vuule commented Oct 23, 2024

Do you have (run) any benchmark to make sure there is no regression?

I haven't because we currently don't do anything differently - we end up calling cudaMemcpyAsync on a pageable buffer.
I'll run all benchmarks once we actually move toward setting allocate_host_as_pinned_threshold and/or kernel_pinned_copy_threshold.

@@ -87,8 +87,10 @@ class datasource_chunk_reader : public data_chunk_reader {
_source->host_read(_offset, read_size, reinterpret_cast<uint8_t*>(h_ticket.buffer.data()));

// copy the host-pinned data on to device
CUDF_CUDA_TRY(cudaMemcpyAsync(
chunk.data(), h_ticket.buffer.data(), read_size, cudaMemcpyDefault, stream.value()));
cudf::detail::cuda_memcpy_async<char>(
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of places where the template argument (char) is given explicitly ... is the compiler really not able to deduce it from the inputs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, maybe it wasn't required here
In general, compiler can't do template type deduction + implicit conversion. So passing a container that get implicitly converted to a span requires the template type for cuda_memcpy_async.

@vuule
Copy link
Contributor Author

vuule commented Oct 23, 2024

/merge

@rapids-bot rapids-bot bot merged commit deb9af4 into rapidsai:branch-24.12 Oct 23, 2024
122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants