-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] multibyte_split implicitly synchronizes all streams #11929
Comments
Does this actually synchronize? |
You're right that neither of those synchronizes. However, |
Thanks for the details, I'll adopt your suggestions |
This PR ensures that cudf's default stream is properly passed to all kernel launches so that nothing implicitly runs on the CUDA default stream. It adds a small library that is built during the tests and overloads CUDA functions to throw an exception when usage of the default stream is detected. It also fixes all remaining usage of anything other than cudf's default stream (I fixed most of the issues in previous PRs, but I found a few others when finalizing this one). Resolves #11929 Resolves #11942 ### Important notes for reviewers: - **The changeset is deceptively large.** The vast majority of the changes are just a global find-and-replace of `cudf::get_default_stream()` for `cudf::default_stream_value`, as well as a few smaller fixes such as missing `CUDF_TEST_PROGRAM_MAIN` in a couple of tests and usage of `rmm::cuda_stream_default`. The meaningful changes are: - The new default stream getter/setter in `default_stream.[hpp|cpp]` - The addition of `cpp/tests/utilities/identify_stream_usage` - The changes to the base testing fixture in `cpp/include/cudf_test/base_fixture.hpp` to inject the custom stream. - The changes to CI in `ci/gpu/build.sh` to build and use the new library. - This PR is a breaking change because it moves the default stream into the detail namespace. Going forward the default stream may only be accessed using the public accessor `cudf::get_default_stream()`. I have added a corresponding setter, but it is also in the detail namespace since I do not want to publicly support changing the default stream yet, only for the purpose of testing. Reviewers, please leave comments if you disagree with those choices. - I have made getting and setting the default stream thread-safe, but there is still only a single stream. In multi-threaded applications we may want to support a stream per thread so that users could manually achieve PTDS with more fine-tuned control. Is this worthwhile? Even if it is, I'm inclined to wait for a subsequent PR to implement this unless someone feels strongly otherwise. - I'm currently only overloading `cudaLaunchKernel`. I can add overloads for other functions as well, but I didn't want to go through the effort of overloading every possible API. If reviewers have a minimal set that they'd like to see overloaded, let me know. [I've included links to all the relevant pages of the CUDA runtime API in the identify_stream_usage.cu file](https://github.com/rapidsai/cudf/pull/11875/files#diff-0b2762207c27c080acd2114475c7a1c06377a7c18c4e9c3de60ecbdc82a4dc61R99) if someone wants to look through them. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Sevag H (https://github.com/sevagh) - https://github.com/brandon-b-miller - Jake Hemstad (https://github.com/jrhemstad) - David Wendt (https://github.com/davidwendt) URL: #11875
Describe the bug
Currently
multibyte_split
uses a stream pool to divide up work between multiple streams. However,output_builder.inplace_resize
still resizes anrmm::device_uvector
on CUDA's default stream. This function is being called fromoutput_builder.advance_output
, which does not currently accept a stream. This is problematic since the default stream implicitly synchronizes with all streams.Expected behavior
From a quick analysis it is not clear to me whether
advance_output
orinplace_resize
should be running on thescan_stream
or theread_stream
insidemultibyte_split
. Based on whereadvance_output
is called, I suspect that it actually requires an explicit synchronization between these two streams, and perhaps the existing events are already intended to solve that need. Fixing this would require modifying this function to run on the appropriate stream, and perhaps introducing an explicit synchronization or some appropriate event waiting.The text was updated successfully, but these errors were encountered: