Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to task-focused synchronization model #374

Merged
merged 31 commits into from
Apr 3, 2023
Merged

Switch to task-focused synchronization model #374

merged 31 commits into from
Apr 3, 2023

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Feb 5, 2023

This PR switches AMDGPU to a task-focused synchronization model, similar to CUDA.jl. In such a model, tasks have their own device and queue assigned, and all GPU operations on that task are launched in serial. That is to say, kernels don't start until any previously-launched kernels have completed. It is thus no longer necessary to wait on the result of @roc, and instead, one can call AMDGPU.synchronize() on the host to wait for all currently-executing kernels (launched from this task) to complete.

There are many reasons why we'd want to switch to this model:

  • The rest of the ecosystem generally expects this, as CUDA.jl pioneered this approach
  • HSA queues are actually in-order, so there's no reason to pretend that they can execute work concurrently
  • It simplifies most common forms of GPU programming, as users are used to thinking about serial computations
  • The programming model approximately matches how Julia tasks execute

The existing wait-focused model is still fully available, so for the most part, behavior remains the same (assuming users were previously only using one task to launch GPU kernels); the previous single-queue behavior can be enabled by calling AMDGPU.Runtime.set_alloc_queue_pool_max!([1,0,0]), which shares a single queue per device, and disables the queue pooling logic for low- and high-priority queues.

Todo:

  • Do the same pooling for HIP streams and track those in TLS
  • Cache task-local ROCm library handles in TLS
  • Track and reuse previously-allocated queue/stream when switching device
  • Skip cleanup by default when waiting on ROCKernelSignal
  • Replace task-local queue if inactive
  • Throw better error in launch when queue is inactive
  • Don't propagate errors through synchronize() by default, make sure that errors do print in background
  • Prefix ROCm lib API calls with context enablement
  • Add tests
  • Add docs

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.

src/highlevel.jl Outdated Show resolved Hide resolved
src/tls.jl Outdated Show resolved Hide resolved
@luraess
Copy link
Contributor

luraess commented Feb 5, 2023

That's awesome @jpsamaroo ! Thanks, we will look into this with @omlins and @utkinis to finalise AMDGPU support in ImplicitGlobalGrid.jl and ParallelStencil.jl.

jpsamaroo and others added 10 commits March 29, 2023 23:05
Add HIPDevice and HIPStream abstractions
Add `synchronize(::HIPStream)`
Stores a task-local device, queue, context, and stream via
`task_local_storage`. Task state can be accessed with
`task_local_state`, and stored with `task_local_state!`. The
`task_local_state!` function has a functional form for locally-scoped
task state changes.

`device()`, `queue()`, `context()`, and `stream()` are available to
query TLS for their appropriate values, and `device!()`, `queue!()`,
and `stream!()` can be used to set a new value in TLS. Additionally,
queue and stream priorities are accessible with `priority()` and
settable with `priority!()`.

`default_queue()` now forwards to `queue()`, and is deprecated.
`default_device()` and friends still work globally, but users should
prefer to use `device()` and `device!()` instead.

There is no longer a single default queue per device; instead, queues
are pooled per-device and are assigned to tasks in a round-robin
fashion. Separate pool sizes are configurable on a per-queue-priority
basis.

Active kernel tracking is now handled by a per-queue background task
which waits on and cleans up the active kernel list (which is now moved
into the `ROCQueue` object).

A new function `synchronize()` is available to wait on all
currently-active kernels on the current queue and/or stream to complete.
Switch library handles to be allocated per-task
Copy HandleCache mechanism from CUDA.jl
Always include library wrapper code
Also use `queue` instead of `default_queue` to avoid triggering depwarn.
pxl-th added 2 commits March 30, 2023 00:59
- Use current default device when creating HIPStream from raw
  hipStream_t. Since there appears to be no way to get the device from
  just hipStream_t.
- Use passed device in TLS for HIPStream creation.
pxl-th added 2 commits March 31, 2023 15:37
- Construct HIPStream using device from current HIPContext.
- Add docs clarifying distinction between `get_default_device()` &
  `device()`.
pxl-th and others added 5 commits March 31, 2023 22:29
This linked list is append-only and singly-linked, and allows multiple
threads to concurrently read from it, while one task may advance the
list head serially.
Uses the new `LinkedList` in place of `Vector` for active kernel
tracking, which allows mostly-lockless concurrent access to the active
kernel set, often with no or minimal allocations.

Propagates errors from the current or prior kernels in `synchronize` by
default, and adds an `errors::Bool` kwarg to `synchronize` to allow
skipping error checking if desired.

Disables all kernel resource cleanup during `wait`, and instead moves
this fully into the kernel monitor. Also fixes a bug in the kernel
monitor where a dead queue may not have all kernel resources cleaned up.

Removes auto-reset of the queue in TLS, and instead provides
`reset_dead_queue!()` to reset the queue if it's dead.

Properly catches and ignores kernel errors in the queue monitor.

Makes the `queue` field in `ROCQueue` no longer atomic, as we do not
change it.
src/runtime/queue.jl Outdated Show resolved Hide resolved
@jpsamaroo jpsamaroo marked this pull request as ready for review April 3, 2023 15:15
@pxl-th pxl-th force-pushed the jps/tls-queue branch 3 times, most recently from 05c41ed to 6f82440 Compare April 3, 2023 17:59
@pxl-th pxl-th merged commit ed61beb into master Apr 3, 2023
@jpsamaroo jpsamaroo deleted the jps/tls-queue branch April 3, 2023 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rocBLAS: Remove old hand-wrapped code State of queues and streams
4 participants