Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Remove stream sync in concatenate for better pipelining #17172

Closed
tgujar opened this issue Oct 24, 2024 · 1 comment · Fixed by #17584
Closed

[PERF] Remove stream sync in concatenate for better pipelining #17172

tgujar opened this issue Oct 24, 2024 · 1 comment · Fixed by #17584
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@tgujar
Copy link
Contributor

tgujar commented Oct 24, 2024

Concatenate function uses thrust::copy in its implementation but throws away the return value from thrust::copy. Since thrust::copy needs to return an iterator it leads to an unnecessary stream sync.

thrust::copy(

@wence-
Copy link
Contributor

wence- commented Oct 24, 2024

Note that it does not suffice to use the nosync exec policy for such thrust calls, since whenever thrust returns a host-side value, even with the nosync exec policy it must sync.

There are numerous places (I have not audited them) where we make such calls but don't actually need the sync. So we should, with benchmarking evidence, consider either implementing non-syncing implementations in libcudf, or upstreaming a CCCL request for such features so that everyone can get them.

See also the nosync exec policy tracking issue: #12086

@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function Performance Performance related issue labels Oct 24, 2024
rapids-bot bot pushed a commit that referenced this issue Dec 13, 2024
Replacing `thrust::copy` with `cudaMemcpyAsync` improves performance upto 2x in specific cases in `cudf::concatenate`
The `thrust::copy` does a sync for device-to-device copy though it is not necessary.  Using `rmm::exec_policy_nosync` had no effect. Will work with CCCL to determine if this is a bug in `thrust::copy` since computing the return value does not require a sync.

Also moved the benchmark for concatenate from googlebench to nvbench.

Closes #17172

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #17584
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants