-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Migrate to Thrust nosync
stream policy for performance.
#12086
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Comments
bdice
added
feature request
New feature or request
Needs Triage
Need team to review and classify
labels
Nov 7, 2022
bdice
added
libcudf
Affects libcudf (C++/CUDA) code.
and removed
Needs Triage
Need team to review and classify
labels
Nov 7, 2022
3 tasks
3 tasks
3 tasks
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Oct 23, 2024
Closes #17117 Related to #12086 This PR replaces the synchronous execution policy with an asynchronous one to eliminate unnecessary synchronization. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - David Wendt (https://github.com/davidwendt) - Shruti Shivakumar (https://github.com/shrshi) - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #17146
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Dec 3, 2024
Part of #12086 Replaced `rmm::exec_policy` with `rmm::exec_policy_nosync` in write_json Authors: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Shruti Shivakumar (https://github.com/shrshi) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #17445
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Motivation & Description
Thrust 1.16 added an execution policy
thrust::cuda::par_nosync
that removes unnecessary internal stream synchronizations, except when required for correctness (e.g. if the algorithm returns a value to the host likethrust::reduce
, a sync is required).In PR #11577, I experimented with a bulk find-replace of
rmm::exec_policy
withrmm::exec_policy_nosync
. This resulted in notable performance improvements, particularly for small data sizes. See: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70However, a plain find-replace may leave issues with stream safety. For example, an internal detail API could be constructing host memory whose lifetime doesn't guarantee a safe copy to device without a stream sync before returning (thought experiment identified by @jrhemstad). Changing to
nosync
execution policies requires analysis of every use case individually, at both the detail and public API levels.(As a reminder of the current stream policy: libcudf APIs called on the host do not guarantee that the stream is synchronized before returning, but this does not mean we can always use
nosync
safely.)Tasks
I am planning to open PRs to use
nosync
across the libcudf codebase. Below is a list of benchmarks that showed improvements. I will need to analyze each benchmark to identify the primary algorithms being called that should be refactored to usenosync
. This list is loosely sorted by largest impact for small data sizes (or whatever fixed data sizes are in the benchmark), which indicates overhead that we can systematically eliminate. Note that these performance improvements usenosync
everywhere, which may not be stream-safe in all cases, so the real performance gains may be lower if not all executions can usenosync
. Additionally, the performance improvements for a given algorithm may rely on improvements in other algorithms, so the full improvement may not be achieved until all tasks are complete.Other notes:
Further work
After addressing the major tasks above where we have clearly identified speedups resulting from
nosync
policies, I will re-assess the rest of the code base to evaluate a broader replacement to usenosync
policies everywhere in libcudf.The text was updated successfully, but these errors were encountered: