Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream: implement stream_workq #6062

Merged
merged 22 commits into from
Aug 30, 2022
Merged

stream: implement stream_workq #6062

merged 22 commits into from
Aug 30, 2022

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Jun 18, 2022

Pull Request Description

Add workq based stream enqueue implementation.

Caveat

The wait kernel will block allocation and free of GPU registered host buffer, resulting in a potential deadlock

It turns out that, at least for CUDA, using unregistered host buffer for staging is fine. I am not sure how cudaMemcpyAsync deals with unregistered host buffer, but no errors! Potentially it wasn't run truely as async, but non-optimal is better than not working at all.

To avoid registered host buffer, this includes genq or yaksa pools since the pools need allocate slabs, yaksa needs an option to treat unregistered host buffer the same as registered buffer, as well as make the pool to use unregistered buffers.

EDIT: we also need avoid yaksa's lazy stream creation because the stream creation is also locked out by the wait kernel.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou marked this pull request as draft June 18, 2022 15:26
@hzhou hzhou force-pushed the 2206_stream_workq branch 8 times, most recently from 6a6451a to 4b8ce89 Compare June 18, 2022 21:25
@hzhou hzhou force-pushed the 2206_stream_workq branch 7 times, most recently from fc1cc26 to e574edc Compare July 6, 2022 15:54
@hzhou hzhou force-pushed the 2206_stream_workq branch 9 times, most recently from 3eea3e5 to 26a299a Compare July 13, 2022 22:34
@hzhou hzhou marked this pull request as ready for review July 13, 2022 22:39
@hzhou hzhou force-pushed the 2206_stream_workq branch from 26a299a to 7dd59d1 Compare July 14, 2022 00:54
@hzhou hzhou force-pushed the 2206_stream_workq branch 3 times, most recently from a3b7c69 to b3606df Compare July 21, 2022 22:37
src/mpi/init/init_async.c Outdated Show resolved Hide resolved
@hzhou hzhou force-pushed the 2206_stream_workq branch from 3c4b3c4 to e5c8323 Compare August 29, 2022 22:39
@hzhou hzhou requested a review from raffenet August 29, 2022 22:41
hzhou added 22 commits August 30, 2022 16:45
Import the same mechanism of compile cu files from yaksa using the
helper script cudalt.sh. We removed the libtool version check because
the mechanism of directly invoking $(libtool) is not stable and libtool
doesn't really check the exact versions.
This provides micro gpu kernels for triggered operations.
Different backends may employ different mechanism to synchronize between
CPU and GPU trigger/wait kernels. Add MPL wrappers for abstractions.

For cuda, this is just voalatile int (allocated on a GPU registered host
buffer). We can't call cudaFreeHost or cudaHostUnregister while a wait
kernel is pending. Thus we maintain a pending wait_kernel_count so we
can delay the cudaFreeHost if necessary.
Allow user to progress a single stream. MPIX_STREAM_NULL is allowed.
It is similar to MPI_Test but without a specific request.

This API allows users to spawn their own progress thread.  The srteam
semantics applies. That is, users are not allowed to concurrently call
MPIX_Stream_progress with another operation on the same stream.
Add functions to launch progress thread that runs MPIX_Stream_progress.
Also replace the implementation of MPIR_CVAR_ASYNC_PROGRESS with the new
API.
Add support facility for enqueue and progress the stream-enqueue
operations, such as MPIX_Send_enqueue. Each workq item have a trigger
event and optionally a done event. GPU stream runtime flips the trigger
event. CPU progress issues the op once trigger event is set. If there is
a done event, the CPU progress will flip the done event once the
operation is completed.
Because the workq implementations are at device layer, we need these ADI
hooks to implement the workq enqueue functionality within the device
layer.
Use this inline function to simplify accessing the local MPIX stream
associated with a stream communicator.
It needs to be included after the  mpidpre.h.
Expose the two utility functions from stream_enqueue.c --
MPIR_get_local_gpu_stream and MPIR_allocate_enqueue_request. Both
functions will be used by implementations of enqueue operations.
Zero the dev portion in case MPIR_Stream_create_impl fail before
MPID_Stream_create_hook. This is because MPID_Stream_free_hook will
be invoked to clean up a partially allocated stream object.
Store stream_ptr instead of gpu_stream in enqueued request. We may need
to access more information pertain to a stream than just the gpu
stream, for example, the workq associated with the stream.
Add ch4 implementation of MPID_Send_enqueue, which uses the stream workq
facility and micro gpu kernels for CPU/GPU synchronizations.

The use of stream workq will require dedicated progress thread. We
default to fall back to MPIR implementations. User can enable stream
workq using MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ.
pulling in the commits that enables "yaksa_has_wait_kernel" info hint.
Add cvar MPIR_CVAR_GPU_HAS_WAIT_KERNEL to supply yaksa_init with info
hint "yaksa_has_wait_kernel". With the hint, yaksa should avoid code
that may potentially dead lock with a wait kernel.
Add stream workq tests by setting MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ and
command line option -progress-thread.
The cuda wait kernel may deadlock with the progress thread.
@hzhou hzhou force-pushed the 2206_stream_workq branch from e5c8323 to a4aac5d Compare August 30, 2022 21:49
@hzhou hzhou merged commit b603e80 into pmodels:main Aug 30, 2022
@hzhou hzhou deleted the 2206_stream_workq branch August 30, 2022 22:32
@hzhou hzhou mentioned this pull request May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants