-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to task-focused synchronization model #374
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jpsamaroo
added
enhancement
New feature or request
performance
needs tests
needs docs
labels
Feb 5, 2023
vchuravy
reviewed
Feb 5, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it.
That's awesome @jpsamaroo ! Thanks, we will look into this with @omlins and @utkinis to finalise AMDGPU support in ImplicitGlobalGrid.jl and ParallelStencil.jl. |
jpsamaroo
force-pushed
the
jps/tls-queue
branch
2 times, most recently
from
February 21, 2023 02:20
5563c12
to
ec06fe4
Compare
jpsamaroo
force-pushed
the
jps/tls-queue
branch
from
February 23, 2023 20:30
ec06fe4
to
290abc8
Compare
This was
linked to
issues
Feb 23, 2023
jpsamaroo
force-pushed
the
jps/tls-queue
branch
3 times, most recently
from
March 8, 2023 20:22
cbc005a
to
c43c447
Compare
Closed
Add HIPDevice and HIPStream abstractions Add `synchronize(::HIPStream)`
Stores a task-local device, queue, context, and stream via `task_local_storage`. Task state can be accessed with `task_local_state`, and stored with `task_local_state!`. The `task_local_state!` function has a functional form for locally-scoped task state changes. `device()`, `queue()`, `context()`, and `stream()` are available to query TLS for their appropriate values, and `device!()`, `queue!()`, and `stream!()` can be used to set a new value in TLS. Additionally, queue and stream priorities are accessible with `priority()` and settable with `priority!()`. `default_queue()` now forwards to `queue()`, and is deprecated. `default_device()` and friends still work globally, but users should prefer to use `device()` and `device!()` instead. There is no longer a single default queue per device; instead, queues are pooled per-device and are assigned to tasks in a round-robin fashion. Separate pool sizes are configurable on a per-queue-priority basis. Active kernel tracking is now handled by a per-queue background task which waits on and cleans up the active kernel list (which is now moved into the `ROCQueue` object). A new function `synchronize()` is available to wait on all currently-active kernels on the current queue and/or stream to complete.
Switch library handles to be allocated per-task Copy HandleCache mechanism from CUDA.jl Always include library wrapper code
Also use `queue` instead of `default_queue` to avoid triggering depwarn.
- Use current default device when creating HIPStream from raw hipStream_t. Since there appears to be no way to get the device from just hipStream_t. - Use passed device in TLS for HIPStream creation.
- Construct HIPStream using device from current HIPContext. - Add docs clarifying distinction between `get_default_device()` & `device()`.
This linked list is append-only and singly-linked, and allows multiple threads to concurrently read from it, while one task may advance the list head serially.
Uses the new `LinkedList` in place of `Vector` for active kernel tracking, which allows mostly-lockless concurrent access to the active kernel set, often with no or minimal allocations. Propagates errors from the current or prior kernels in `synchronize` by default, and adds an `errors::Bool` kwarg to `synchronize` to allow skipping error checking if desired. Disables all kernel resource cleanup during `wait`, and instead moves this fully into the kernel monitor. Also fixes a bug in the kernel monitor where a dead queue may not have all kernel resources cleaned up. Removes auto-reset of the queue in TLS, and instead provides `reset_dead_queue!()` to reset the queue if it's dead. Properly catches and ignores kernel errors in the queue monitor. Makes the `queue` field in `ROCQueue` no longer atomic, as we do not change it.
pxl-th
reviewed
Mar 31, 2023
pxl-th
force-pushed
the
jps/tls-queue
branch
3 times, most recently
from
April 3, 2023 17:59
05c41ed
to
6f82440
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR switches AMDGPU to a task-focused synchronization model, similar to CUDA.jl. In such a model, tasks have their own device and queue assigned, and all GPU operations on that task are launched in serial. That is to say, kernels don't start until any previously-launched kernels have completed. It is thus no longer necessary to
wait
on the result of@roc
, and instead, one can callAMDGPU.synchronize()
on the host to wait for all currently-executing kernels (launched from this task) to complete.There are many reasons why we'd want to switch to this model:
The existing
wait
-focused model is still fully available, so for the most part, behavior remains the same (assuming users were previously only using one task to launch GPU kernels); the previous single-queue behavior can be enabled by callingAMDGPU.Runtime.set_alloc_queue_pool_max!([1,0,0])
, which shares a single queue per device, and disables the queue pooling logic for low- and high-priority queues.Todo:
ROCKernelSignal
synchronize()
by default, make sure that errors do print in background