-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Use nosync policy for Thrust calls. #11577
Conversation
Historically I believe we decided that libcudf shouldn't do any stream synchronization except what is required for correctness or when returning data to host memory that could be easily operated on without synchronization by typical CPU code erroneously. In general I'm an enthusiastic +1 to the idea of this PR. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11577 +/- ##
===============================================
Coverage ? 86.40%
===============================================
Files ? 145
Lines ? 22958
Branches ? 0
===============================================
Hits ? 19837
Misses ? 3121
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
…11600) This PR is derived from changes I made in #11577 while attempting to consolidate stream handling in public APIs. During that refactoring, I noticed three repeated problems across libcudf APIs that I have addressed in this PR. These refactors will make future work on streams much more straightforward as well as increase consistency and quality in the library. 1. Some APIs were putting too much implementation in a public method. I split these so that the public/detail balance is consistent with the rest of libcudf. 2. A number of public APIs were missing `CUDF_FUNC_RANGE`, making it difficult to recognize those functions in profiles (cc: @GregoryKimball). 3. Stream handling was not consistent, with some functions not using the `stream` they were passed and using `cudf::default_stream_value` instead. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11600
That's correct, but we've also been extraordinarily lax in our stream sync safety. I'm pretty confident that there are many places where we're copying from host memory asynchronously without synchronizing to ensure the host object is still valid before the copy completes. I think the only thing saving us are the facts that we're copying from pageable memory and that Thrust is injecting a bunch of artificial syncs for us. Eliminating all artificial syncs in favor of putting a single one at the end is a reasonable step forward, but getting to a point where every libcudf API is 100% safe for asynchrony with the absolute minimum number of syncs is going to be a lot of work. |
I wanted to add some notes from offline discussions, for the record -- and to provide some forward guidance on how this might be resolved. First, a few performance highlights that motivate this work:
Above, I noted the performance of two different modes: "no syncs" and "single final sync." To know whether a final sync (or intermediate syncs) are required for correctness, we must do manual analysis rather than the fully-automated batch refactor I took in this PR. That is why this PR is experimental and cannot be merged directly -- it was just a way to find what could be improved by reducing the number of syncs, assuming that stream safety is never an issue (which is an unsafe assumption). We have to do manual analysis to prevent the kind of potential problems that @jrhemstad mentioned:
The consensus among those I've spoken with (@jrhemstad, @davidwendt, and developers on Spark/Python teams) seems to be that the performance improvements of fewer syncs for small data sizes would be a worthwhile change. However, it will require manual analysis of stream correctness in each API, and there are no tools that can help us automate this process to make it faster than the manual task of "thinking about it." Edge cases such as ensuring host memory is async-copied before the end of its lifetime are crucial to catch for correctness.
Agreed, it is a large undertaking. My goal for the "manual anlysis" is to pick a few APIs that show a substantial performance increase from the benchmarks above, and manually verify which Thrust calls can safely use a Next steps:
|
Fixes some calls that were not passing the stream variable to detail functions. Found these while looking into improvements for #11577 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - Mark Harris (https://github.com/harrism) URL: #11642
) Adds calls to `cudf::column.set_null_count()` when the null-count is known. Found these while looking into improvements for #11577 There are several ways to make a `cudf::column` object to be returned. Many times the column is created and then filled in by calling the `cudf::column.mutable_view()` function and using the `mutable_view` object. The `cudf::column::mutable_view()` function has a side-effect that invalidates it's internal null-count. This is for efficiency so the null-count is only computed when the value is specifically requested through the `cudf::column::null_count()` method. Computing the null-count inside `null_count()` requires a kernel launch. However, there are several places where the null-count is known before returning the column and setting the value means a later call to `cudf::column::null_count()` does not require it to be computed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec URL: #11646
This PR has been labeled |
This PR uses `rmm::exec_policy_nosync` in libcudf's gather and scatter functions. These changes are motivated by performance improvements seen previously in #11577. # Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #12038
I'm closing this PR. The major findings from this experiment and next steps are documented in issue #12086. I believe most of the documentation tasks mentioned above regarding stream safety are now in the Developer Guide section: "Treat libcudf APIs as if they were asynchronous." |
Description
This PR draft is an experiment using
rmm::exec_policy_nosync
to call all Thrust algorithms with thethrust::cuda::par_nosync
execution policy. This removes many instances of stream synchronization in Thrust, except when required for correctness (e.g. if the algorithm returns a value to the host, a sync is required).At present, two commits have been benchmarked. 946cf5d directly replaces all instances of the execution policy with
exec_policy_nosync
. This could lead to unsynced streams when the libcudf public API returns for some functions, but shows a clear performance benefit for small data sizes in the benchmarks: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70Commit 2552c4c adds a manual stream synchronization at the end of every public API. This is guaranteed to be correct but the final stream sync may not be necessary for all APIs if the detail API already synced, leading to lower performance for some APIs in the benchmarks: https://gist.github.com/bdice/4ade40a2e66d555fb8edc85f78eec0a2
I don't intend for this PR to be merged (or reviewed as-is) at this point -- there are better designs for managing syncs that we could explore like RAII, I have some internal refactors I'd like to make before engaging in such a large refactor, and it should certainly be done in pieces -- this PR is currently just a way to share preliminary data and start a discussion for improved stream handling.
A few notes:
nosync
execution policies will require analysis of every use case individually, at both the detail and public API levels.Checklist