Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] multibyte_split implicitly synchronizes all streams #11929

Closed
vyasr opened this issue Oct 14, 2022 · 5 comments · Fixed by #11875
Closed

[BUG] multibyte_split implicitly synchronizes all streams #11929

vyasr opened this issue Oct 14, 2022 · 5 comments · Fixed by #11875
Assignees
Labels
bug Something isn't working

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 14, 2022

Describe the bug
Currently multibyte_split uses a stream pool to divide up work between multiple streams. However, output_builder.inplace_resize still resizes an rmm::device_uvector on CUDA's default stream. This function is being called from output_builder.advance_output, which does not currently accept a stream. This is problematic since the default stream implicitly synchronizes with all streams.

Expected behavior
From a quick analysis it is not clear to me whether advance_output or inplace_resize should be running on the scan_stream or the read_stream inside multibyte_split. Based on where advance_output is called, I suspect that it actually requires an explicit synchronization between these two streams, and perhaps the existing events are already intended to solve that need. Fixing this would require modifying this function to run on the appropriate stream, and perhaps introducing an explicit synchronization or some appropriate event waiting.

@vyasr vyasr added bug Something isn't working Needs Triage Need team to review and classify labels Oct 14, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Oct 14, 2022

Perhaps @upsj or @cwharris have some insight into the appropriate fix here.

@upsj
Copy link
Contributor

upsj commented Oct 15, 2022

Does this actually synchronize? device_uvector doesn't and device_buffer doesn't either in the shrinking case, as far as I can tell. It only modifies the size member appropriately.
I have a choice between two approaches: Use device_uvector as a fixed-sized buffer and store its actual size separately, or use size and capacity to keep track of how many elements inside the vector are already in use. What do you think - should this be marked/documented more explicitly or store the size separately?
Without the sync-less resize, having a separate capacity and size seems unnecessary to me, and the reserve method no longer has a valid use case.

@upsj upsj self-assigned this Oct 15, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Oct 15, 2022

Does this actually synchronize? device_uvector doesn't and device_buffer doesn't either in the shrinking case, as far as I can tell. It only modifies the size member appropriately.

You're right that neither of those synchronizes. However, device_buffer::resize sets the stream used for deallocating the buffer, which ends up triggering a sync on that stream when the destructor of the uvector is called. Both char_storage and row_offset_storage are function scoped, so they are only destroyed at the end of the function when this function's frame is popped off the stack. join_streams is called well before that and it looks like the input stream is set to wait on both the streams in the pool, so I think all that needs to happen is that the input stream needs to get passed through advance_output to inplace_resize so that the vectors in the two output_builders (char_storage and row_offset_storage) are destroyed on the user-provided input stream, i.e. after all work in multibyte split is done but without using the default stream. Avoiding the default stream saves us triggering a sync at the end of multibyte_split, which will be undesirable once we expose a truly stream-ordered API that allows users to run multiple libcudf APIs (or other external functions) simultaneously on different streams.

@upsj
Copy link
Contributor

upsj commented Oct 16, 2022

Thanks for the details, I'll adopt your suggestions

@vyasr
Copy link
Contributor Author

vyasr commented Oct 17, 2022

@upsj in the interest of making sure everything works together I saved you the time and have gone ahead and implemented my proposed solution in #11875. Tests still pass for me locally, so I think it should be OK, but I'll keep you apprised if something goes wrong!

rapids-bot bot pushed a commit that referenced this issue Oct 21, 2022
This PR ensures that cudf's default stream is properly passed to all kernel launches so that nothing implicitly runs on the CUDA default stream. It adds a small library that is built during the tests and overloads CUDA functions to throw an exception when usage of the default stream is detected. It also fixes all remaining usage of anything other than cudf's default stream (I fixed most of the issues in previous PRs, but I found a few others when finalizing this one).

Resolves #11929
Resolves #11942 

### Important notes for reviewers:
- **The changeset is deceptively large.** The vast majority of the changes are just a global find-and-replace of `cudf::get_default_stream()` for `cudf::default_stream_value`, as well as a few smaller fixes such as missing `CUDF_TEST_PROGRAM_MAIN` in a couple of tests and usage of `rmm::cuda_stream_default`. The meaningful changes are:
    - The new default stream getter/setter in `default_stream.[hpp|cpp]`
    - The addition of `cpp/tests/utilities/identify_stream_usage`
    - The changes to the base testing fixture in `cpp/include/cudf_test/base_fixture.hpp` to inject the custom stream.
    - The changes to CI in `ci/gpu/build.sh` to build and use the new library.
- This PR is a breaking change because it moves the default stream into the detail namespace. Going forward the default stream may only be accessed using the public accessor `cudf::get_default_stream()`. I have added a corresponding setter, but it is also in the detail namespace since I do not want to publicly support changing the default stream yet, only for the purpose of testing. Reviewers, please leave comments if you disagree with those choices.
- I have made getting and setting the default stream thread-safe, but there is still only a single stream. In multi-threaded applications we may want to support a stream per thread so that users could manually achieve PTDS with more fine-tuned control. Is this worthwhile? Even if it is, I'm inclined to wait for a subsequent PR to implement this unless someone feels strongly otherwise.
- I'm currently only overloading `cudaLaunchKernel`. I can add overloads for other functions as well, but I didn't want to go through the effort of overloading every possible API. If reviewers have a minimal set that they'd like to see overloaded, let me know. [I've included links to all the relevant pages of the CUDA runtime API in the identify_stream_usage.cu file](https://github.com/rapidsai/cudf/pull/11875/files#diff-0b2762207c27c080acd2114475c7a1c06377a7c18c4e9c3de60ecbdc82a4dc61R99) if someone wants to look through them.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Bradley Dice (https://github.com/bdice)
  - Sevag H (https://github.com/sevagh)
  - https://github.com/brandon-b-miller
  - Jake Hemstad (https://github.com/jrhemstad)
  - David Wendt (https://github.com/davidwendt)

URL: #11875
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants