-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stream: implement stream_workq #6062
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hzhou
force-pushed
the
2206_stream_workq
branch
8 times, most recently
from
June 18, 2022 21:25
6a6451a
to
4b8ce89
Compare
hzhou
force-pushed
the
2206_stream_workq
branch
7 times, most recently
from
July 6, 2022 15:54
fc1cc26
to
e574edc
Compare
hzhou
force-pushed
the
2206_stream_workq
branch
9 times, most recently
from
July 13, 2022 22:34
3eea3e5
to
26a299a
Compare
hzhou
force-pushed
the
2206_stream_workq
branch
3 times, most recently
from
July 21, 2022 22:37
a3b7c69
to
b3606df
Compare
raffenet
reviewed
Aug 29, 2022
hzhou
force-pushed
the
2206_stream_workq
branch
from
August 29, 2022 22:39
3c4b3c4
to
e5c8323
Compare
raffenet
approved these changes
Aug 30, 2022
Import the same mechanism of compile cu files from yaksa using the helper script cudalt.sh. We removed the libtool version check because the mechanism of directly invoking $(libtool) is not stable and libtool doesn't really check the exact versions.
This provides micro gpu kernels for triggered operations.
Different backends may employ different mechanism to synchronize between CPU and GPU trigger/wait kernels. Add MPL wrappers for abstractions. For cuda, this is just voalatile int (allocated on a GPU registered host buffer). We can't call cudaFreeHost or cudaHostUnregister while a wait kernel is pending. Thus we maintain a pending wait_kernel_count so we can delay the cudaFreeHost if necessary.
Allow user to progress a single stream. MPIX_STREAM_NULL is allowed. It is similar to MPI_Test but without a specific request. This API allows users to spawn their own progress thread. The srteam semantics applies. That is, users are not allowed to concurrently call MPIX_Stream_progress with another operation on the same stream.
Add functions to launch progress thread that runs MPIX_Stream_progress. Also replace the implementation of MPIR_CVAR_ASYNC_PROGRESS with the new API.
Add support facility for enqueue and progress the stream-enqueue operations, such as MPIX_Send_enqueue. Each workq item have a trigger event and optionally a done event. GPU stream runtime flips the trigger event. CPU progress issues the op once trigger event is set. If there is a done event, the CPU progress will flip the done event once the operation is completed.
Because the workq implementations are at device layer, we need these ADI hooks to implement the workq enqueue functionality within the device layer.
Use this inline function to simplify accessing the local MPIX stream associated with a stream communicator.
It needs to be included after the mpidpre.h.
Expose the two utility functions from stream_enqueue.c -- MPIR_get_local_gpu_stream and MPIR_allocate_enqueue_request. Both functions will be used by implementations of enqueue operations.
Zero the dev portion in case MPIR_Stream_create_impl fail before MPID_Stream_create_hook. This is because MPID_Stream_free_hook will be invoked to clean up a partially allocated stream object.
Store stream_ptr instead of gpu_stream in enqueued request. We may need to access more information pertain to a stream than just the gpu stream, for example, the workq associated with the stream.
Add ch4 implementation of MPID_Send_enqueue, which uses the stream workq facility and micro gpu kernels for CPU/GPU synchronizations. The use of stream workq will require dedicated progress thread. We default to fall back to MPIR implementations. User can enable stream workq using MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ.
pulling in the commits that enables "yaksa_has_wait_kernel" info hint.
Add cvar MPIR_CVAR_GPU_HAS_WAIT_KERNEL to supply yaksa_init with info hint "yaksa_has_wait_kernel". With the hint, yaksa should avoid code that may potentially dead lock with a wait kernel.
Add stream workq tests by setting MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ and command line option -progress-thread.
The cuda wait kernel may deadlock with the progress thread.
hzhou
force-pushed
the
2206_stream_workq
branch
from
August 30, 2022 21:49
e5c8323
to
a4aac5d
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
Add workq based stream enqueue implementation.
Caveat
The wait kernel will block allocation and free of GPU registered host buffer, resulting in a potential deadlock
It turns out that, at least for CUDA, using unregistered host buffer for staging is fine. I am not sure how
cudaMemcpyAsync
deals with unregistered host buffer, but no errors! Potentially it wasn't run truely as async, but non-optimal is better than not working at all.To avoid registered host buffer, this includes genq or yaksa pools since the pools need allocate slabs, yaksa needs an option to treat unregistered host buffer the same as registered buffer, as well as make the pool to use unregistered buffers.
EDIT: we also need avoid yaksa's lazy stream creation because the stream creation is also locked out by the wait kernel.
[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.