stream: implement stream_workq #6062

hzhou · 2022-06-18T15:25:51Z

Pull Request Description

Add workq based stream enqueue implementation.

Caveat

The wait kernel will block allocation and free of GPU registered host buffer, resulting in a potential deadlock

It turns out that, at least for CUDA, using unregistered host buffer for staging is fine. I am not sure how cudaMemcpyAsync deals with unregistered host buffer, but no errors! Potentially it wasn't run truely as async, but non-optimal is better than not working at all.

To avoid registered host buffer, this includes genq or yaksa pools since the pools need allocate slabs, yaksa needs an option to treat unregistered host buffer the same as registered buffer, as well as make the pool to use unregistered buffers.

EDIT: we also need avoid yaksa's lazy stream creation because the stream creation is also locked out by the wait kernel.

[skip warnings]

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

src/mpi/init/init_async.c

Import the same mechanism of compile cu files from yaksa using the helper script cudalt.sh. We removed the libtool version check because the mechanism of directly invoking $(libtool) is not stable and libtool doesn't really check the exact versions.

This provides micro gpu kernels for triggered operations.

Different backends may employ different mechanism to synchronize between CPU and GPU trigger/wait kernels. Add MPL wrappers for abstractions. For cuda, this is just voalatile int (allocated on a GPU registered host buffer). We can't call cudaFreeHost or cudaHostUnregister while a wait kernel is pending. Thus we maintain a pending wait_kernel_count so we can delay the cudaFreeHost if necessary.

Allow user to progress a single stream. MPIX_STREAM_NULL is allowed. It is similar to MPI_Test but without a specific request. This API allows users to spawn their own progress thread. The srteam semantics applies. That is, users are not allowed to concurrently call MPIX_Stream_progress with another operation on the same stream.

Add functions to launch progress thread that runs MPIX_Stream_progress. Also replace the implementation of MPIR_CVAR_ASYNC_PROGRESS with the new API.

Add support facility for enqueue and progress the stream-enqueue operations, such as MPIX_Send_enqueue. Each workq item have a trigger event and optionally a done event. GPU stream runtime flips the trigger event. CPU progress issues the op once trigger event is set. If there is a done event, the CPU progress will flip the done event once the operation is completed.

Because the workq implementations are at device layer, we need these ADI hooks to implement the workq enqueue functionality within the device layer.

Use this inline function to simplify accessing the local MPIX stream associated with a stream communicator.

It needs to be included after the mpidpre.h.

Expose the two utility functions from stream_enqueue.c -- MPIR_get_local_gpu_stream and MPIR_allocate_enqueue_request. Both functions will be used by implementations of enqueue operations.

Zero the dev portion in case MPIR_Stream_create_impl fail before MPID_Stream_create_hook. This is because MPID_Stream_free_hook will be invoked to clean up a partially allocated stream object.

Store stream_ptr instead of gpu_stream in enqueued request. We may need to access more information pertain to a stream than just the gpu stream, for example, the workq associated with the stream.

Add ch4 implementation of MPID_Send_enqueue, which uses the stream workq facility and micro gpu kernels for CPU/GPU synchronizations. The use of stream workq will require dedicated progress thread. We default to fall back to MPIR implementations. User can enable stream workq using MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ.

pulling in the commits that enables "yaksa_has_wait_kernel" info hint.

Add cvar MPIR_CVAR_GPU_HAS_WAIT_KERNEL to supply yaksa_init with info hint "yaksa_has_wait_kernel". With the hint, yaksa should avoid code that may potentially dead lock with a wait kernel.

Add stream workq tests by setting MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ and command line option -progress-thread.

The cuda wait kernel may deadlock with the progress thread.

hzhou marked this pull request as draft June 18, 2022 15:26

hzhou force-pushed the 2206_stream_workq branch 8 times, most recently from 6a6451a to 4b8ce89 Compare June 18, 2022 21:25

hzhou force-pushed the 2206_stream_workq branch 7 times, most recently from fc1cc26 to e574edc Compare July 6, 2022 15:54

hzhou force-pushed the 2206_stream_workq branch 9 times, most recently from 3eea3e5 to 26a299a Compare July 13, 2022 22:34

hzhou marked this pull request as ready for review July 13, 2022 22:39

hzhou force-pushed the 2206_stream_workq branch from 26a299a to 7dd59d1 Compare July 14, 2022 00:54

hzhou force-pushed the 2206_stream_workq branch 3 times, most recently from a3b7c69 to b3606df Compare July 21, 2022 22:37

raffenet reviewed Aug 29, 2022

View reviewed changes

src/mpi/init/init_async.c Outdated Show resolved Hide resolved

hzhou force-pushed the 2206_stream_workq branch from 3c4b3c4 to e5c8323 Compare August 29, 2022 22:39

hzhou requested a review from raffenet August 29, 2022 22:41

raffenet approved these changes Aug 30, 2022

View reviewed changes

hzhou added 22 commits August 30, 2022 16:45

mpl: add mpl_gpu_cuda_kernels.cu

e188013

This provides micro gpu kernels for triggered operations.

mpix: add MPIX_{Start/Stop}_progress_thread

0913836

Add functions to launch progress thread that runs MPIX_Stream_progress. Also replace the implementation of MPIR_CVAR_ASYNC_PROGRESS with the new API.

ADI: add MPID_Stream_{create,free}_hook

1702b83

Because the workq implementations are at device layer, we need these ADI hooks to implement the workq enqueue functionality within the device layer.

comm: add MPIR_stream_comm_get_local_stream

c08eec6

Use this inline function to simplify accessing the local MPIX stream associated with a stream communicator.

stream: fix mpir_stream.h inclusion order

c56760c

It needs to be included after the mpidpre.h.

stream: clean up utitlity functions

a904844

stream: add stream_util.c

9e55cc2

Expose the two utility functions from stream_enqueue.c -- MPIR_get_local_gpu_stream and MPIR_allocate_enqueue_request. Both functions will be used by implementations of enqueue operations.

stream: zero dev portion of the stream object

82dc412

Zero the dev portion in case MPIR_Stream_create_impl fail before MPID_Stream_create_hook. This is because MPID_Stream_free_hook will be invoked to clean up a partially allocated stream object.

stream: store stream_ptr instead of gpu_stream

daff80b

Store stream_ptr instead of gpu_stream in enqueued request. We may need to access more information pertain to a stream than just the gpu stream, for example, the workq associated with the stream.

ch4: initialize stream workq

006e9ff

ADI: add MPID_Recv_enqueue

1f4e400

ADI: add MPID_Isend_enqueue and MPID_Irecv_enqueue

77de205

ADI: add MPID_Wait_enqueue and MPID_Waitall_enqueue

a9f1918

modules: upgrade yaksa

ec953c9

pulling in the commits that enables "yaksa_has_wait_kernel" info hint.

gpu: add cvar MPIR_CVAR_GPU_HAS_WAIT_KERNEL

07f9857

Add cvar MPIR_CVAR_GPU_HAS_WAIT_KERNEL to supply yaksa_init with info hint "yaksa_has_wait_kernel". With the hint, yaksa should avoid code that may potentially dead lock with a wait kernel.

test: add stream workq tests

c4a41e0

Add stream workq tests by setting MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ and command line option -progress-thread.

test/xfail: xfail the stream test using wait kernel

a4aac5d

The cuda wait kernel may deadlock with the progress thread.

hzhou force-pushed the 2206_stream_workq branch from e5c8323 to a4aac5d Compare August 30, 2022 21:49

hzhou merged commit b603e80 into pmodels:main Aug 30, 2022

hzhou deleted the 2206_stream_workq branch August 30, 2022 22:32

hzhou mentioned this pull request May 15, 2023

How To Use MPIX_Stream? #6517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stream: implement stream_workq #6062

stream: implement stream_workq #6062

hzhou commented Jun 18, 2022 •

edited

Loading

stream: implement stream_workq #6062

stream: implement stream_workq #6062

Conversation

hzhou commented Jun 18, 2022 • edited Loading

Pull Request Description

Caveat

Author Checklist

hzhou commented Jun 18, 2022 •

edited

Loading