Use HIP as kernel backend instead of HSA #423

pxl-th · 2023-05-19T17:08:34Z

Bump minimum required ROCm version to 5.3 because of stream-ordered allocator.
Use HIP modules instead of HSA.
Remove SyncState. Since we now use only HIPStreams, there is no need to keep track of different kernel launches.
Fix failing tests for findmax.
Fixes: Test failures locally on 1.9.0-beta4 -- Radeon 6800XT #400 (comment)
Do not specialize on shared memory size in reduce kernels.
Do not cache binary dependency config, do runtime discovery of them.
Fixes: First install with JULIA_AMDGPU_DISABLE_ARTIFACTS leads to broken config #424
Report exception frames (when launching with -g2).
Additionally, exception reporting now does not rely on hostcalls, thus there is no performance penalty.
And with exec-once gate (⊡ macro) we prevent duplication of exceptions from multiple threads and multiple grid elements as can be seen from the example below.
This also allows us to report precise location of the exception.

julia> using AMDGPU
[ Info: Precompiling AMDGPU [21141c5a-9bdb-4563-92ae-f87d6854732e]

julia> function f(x)
           x[2] = 0
           return
       end
f (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1);

julia> @roc gridsize=64 groupsize=128 f(x);

julia> AMDGPU.synchronize()
ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] #throw_boundserror
   @ /home/pxl-th/.julia/dev/AMDGPU/src/device/quirks.jl:6
 [2] checkbounds
   @ ./abstractarray.jl:709
 [3] #setindex!
   @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/array.jl:88
 [4] f
   @ ./REPL[2]:2

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] synchronize(s::HIPStream)
   @ AMDGPU ~/.julia/dev/AMDGPU/src/highlevel.jl:155
 [3] synchronize()
   @ AMDGPU ~/.julia/dev/AMDGPU/src/highlevel.jl:154
 [4] top-level scope
   @ REPL[4]:1

vchuravy · 2023-05-23T11:34:52Z

Pass LLVM bitcode to HIP for kernel compilation.

That is not ideal. Is there a way for HIP to accept so files? The issue here is that the primary source of incompatibilities between the ROCM stack and Julia is LLVM version mismatch, I think we can fix parts of it but we need a "neutral" interchange format like an elf file.

I wonder if we could reduce the overhead coming the other way, HIP uses HSA internally and maybe they could expose an interface to access the HSA events behind a stream?

Generally in favor for using KernelState for exception handling

pxl-th · 2023-05-23T12:12:32Z

Is there a way for HIP to accept so files?

Yes, there is. Just confirmed that the previous way we did compilation works.
My thinking was that we (potentially) could apply missing target-specific optimizations if we pass LLVM bitcode to HIP.

This PR in particular tries to solve the following issue (besides making things faster):

Skip host synchronization when it is safe to do so #419 removed wait on kernel launches and performed synchronization only at hsa | hip boundaries.
So hsa | hsa or hip | hip relies on on-device serialization.
But this surfaced an issue where GC would kick in too fast, freeing the arrays that were passed to hip kernels.
To fix this, I've moved all array allocations to HIP thus making them async and stream-ordered: Migrate to stream-ordered allocator for arrays #422.
However, HIP uses memory pools and you can limit their size only starting from ROCm 5.5: Support for setting hipLimitMallocHeapSize on ROCm 5.4? ROCm/hipamd#70
Otherwise, memory pool grows too big, and steals all resources from HSA queue for kernel dispatch.
Trying to limit pool size on the Julia side seems unreliable (or costly).
Finally, moving kernels to HIP should hopefully fix this and remove the need to synchronize at hsa | hip boundaries.

pxl-th · 2023-05-23T12:16:15Z

However, there seems to be no way to define extinit global variables like there is with HSA:
HSA.executable_agent_global_variable_define

So the way we did exceptions, printing, malloc, etc. via hostcalls does not work and requires changes...

vchuravy · 2023-05-23T13:15:24Z

My thinking was that we (potentially) could apply missing target-specific optimizations if we pass LLVM bitcode to HIP

Maybe, but I am more worried about the LLVM inconsistencies. We are fixing a little bit of this on the Julia side come Julia v1.10, but fixing the "requiring matched LLVM versions" is a more important thing to fix.

If we can give HIP the so I am all for this. Death to HSA, long live HIP 😢

vchuravy · 2023-05-23T13:16:19Z

However, there seems to be no way to define extinit global variables like there is with HSA:

Hm I thought comgr handled that. Maybe we need to take another look how hipcc deals with these.

pxl-th · 2023-05-25T13:08:13Z

There is a strange bug with reporting exception frames though (see PR description about frame reporting addition).
It works when module is precompiled at using AMDGPU (e.g. you changed the source code).
Then it works all the time without issues in that Julia session.

But the next time you do using AMDGPU (without any changes to the source code) without precompilation and execute kernel that throws an exception it segfaults:

[120672] signal (11.1): Segmentation fault
in expression starting at none:0
jl_valid_type_param at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:1282 [inlined]
jl_f_apply_type at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:1328
unsafe_load at /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/output.jl:140
unknown function (ip: 0x7f18540ba622)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
#33 at /home/pxl-th/.julia/dev/AMDGPU/src/compiler/output_context.jl:38
unknown function (ip: 0x7f18540b9d05)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:730
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/hostcall.jl:305 [inlined]
#39 at ./threadingconstructs.jl:373
unknown function (ip: 0x7f18540b5e1f)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 15884493 (Pool: 15871716; Big: 12777); GC: 23
Segmentation fault (core dumped)

The line in question is:

AMDGPU.jl/src/device/gcn/output.jl

Line 140 in afa0508

T = unsafe_pointer_to_objref(T_ptr)

Where it tries to get the type from pointer to it.
The pointer itself was obtained from:

AMDGPU.jl/src/device/gcn/output.jl

Lines 163 to 165 in afa0508

    
           @generated function _pointer_from_type(::Type{T}) where T 
        
               UInt64(pointer_from_objref(T)) 
        
           end

And written to the hostcall buffer on the device.

Maybe when there is no precompilation, things go wrong (maybe something goes to cache...) :/
@jpsamaroo or @vchuravy maybe you have some suggestions...

This happens for Base.Cstring type.

pxl-th · 2023-05-25T14:09:47Z

In the first session with precompilation we get matching pointers to Cstring:

T_ptr                         0x00007f5e3f1c2060
pointer_from_objref(Cstring): 0x00007f5e3f1c2060

In the second session without precompilation we get different:

T_ptr                         0x00007f5e3f1c2060
pointer_from_objref(Cstring): 0x00007f8ae8bc2060

And T_ptr matches the one from the first session.

vchuravy · 2023-05-25T14:18:53Z

Yeah that seems like an illegal generated function and if it happens to be cached things will go badly

vchuravy · 2023-05-25T14:24:43Z

We will have to come up with a better solution when we want to support caching anyway... I would try just making it pure? Instead of generated?

pxl-th · 2023-05-25T14:57:10Z

I would try just making it pure? Instead of generated?

Same result. BTW, this happens only for exception reporting. Regular @rocprintf calls from kernels work and they all use that method.

pxl-th · 2023-05-25T15:09:41Z

Placing the code of that function where it was called also gives the same result.
So the problem may be somewhere deeper

pxl-th · 2023-05-30T12:10:13Z

Nerf.jl benchmark, 1000 training steps, 512 batch size:

[email protected]:

54.938175 seconds (8.76 M allocations: 418.460 MiB, 2.47% gc time, 0.02% compilation time)

This PR:

41.944204 seconds (53.07 M allocations: 947.432 MiB, 0.41% gc time, 0.01% compilation time)

Need to debug what's causing so many allocations.

pxl-th · 2023-05-30T14:04:00Z

The downside is that we now have even more memory pressure which leads to the same situation where hsa queue is unable to allocate resources it needs for kernel dispatch.

:0:rocdevice.cpp            :2672: 10404529258 us: 80780: [tid:0x7fe7263ff640]
    Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES:
        The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008

[80780] signal (6.-6): Aborted

This is likely because we are now dispatching on the host too fast and not freeing allocations in time.

I had to resort to more manual memory freeing, like adding AMDGPU.Mem.definitely_free function (naming is temporary and function itself should be macro).

Which captures all device allocations during user's function execuion and frees them immediately after.
Example usage is Flux & Zygote, where you can't call (or add) unsafe_free! in the model's forward pass.

model = # some Flux model for example
AMDGPU.Mem.definitely_free() do
   gradients = Zygote.gradient(model) do m
       y = model(x)
   end
   apply!(optimizer, gradients, model)
end

This may be fixed by ROCm 5.5 where you can specify hard limit on the memory pool size.
Although it is weird that HSA queue suffers from this.
Instead of failing on kernel dispatches it should fail for user allocations and have enough reserved memory for a queue to operate... :/

luraess · 2023-05-30T17:22:14Z

Testing the following memcpy code trying to saturate the memory bandwidth on MI250x results in following scaling:

using AMDGPU#master (peak ~1.38 TB/s):
using this PR (peak ~1.27 TB/s):

Code used:
⚠️ nblocks definition needs to ba adapted depending which branch is tested (on #master it = grid)

using AMDGPU
using Printf
using CairoMakie
using BenchmarkTools

function mycopy!(dst,src)
    I = (AMDGPU.workgroupIdx().x-1)*AMDGPU.workgroupDim().x + AMDGPU.workitemIdx().x
    if I <= length(dst)
        @inbounds dst[I] = src[I]
    end
    return
end

function main(; N=256)
    dst = ROCArray{Float64}(undef,N)
    src = ROCArray{Float64}(undef,N)
    nthreads = 256
    # nblocks  = N # master
    nblocks  = cld(N,nthreads) # pxl-th/hiprtc
    GC.gc()
    GC.enable(false)
    # profile
    tm = @belapsed begin
        @roc groupsize=$nthreads gridsize=$nblocks mycopy!($dst,$src)
        AMDGPU.synchronize()
    end# evals=10 samples=100
    mtp = 2*N*sizeof(eltype(dst))/tm * 1e-9 # GB/s
    # @printf "mem_rate = %1.2e \n" mtp
    GC.enable(true)
    return mtp
end

function run_benchmark(N_rng)
    println("memcopy AMDGPU"); B_memcopy = [main(N=N) for N in N_rng]
    fig = Figure(fontsize=24)
    ax  = Axis(fig[1,1];xscale=log2,xlabel="n",ylabel="GB/s")
    scatterlines!(ax,N_rng,B_memcopy;label="memcopy AMDGPU")
    axislegend(ax; position=:lt)
    return fig
end

save("memcopy.png",run_benchmark( (32*2 .^ (1:10)).^2 ))

pxl-th · 2023-05-31T08:52:46Z

So the drop in performance happens at 2048^2 - 4096^2 (reproducible on Navi 2 as well).

Looking at simplified version of the code in profiler (notice, here we synchronize each iteration).

function main(; N)
    dst = ROCArray{Float64}(undef, N)
    src = ROCArray{Float64}(undef, N)
    nthreads = 256
    nblocks = cld(N, nthreads)
    for i in 1:20
        @roc groupsize=nthreads gridsize=nblocks mycopy!(dst, src)
        AMDGPU.synchronize()
    end
    return
end

`2048^2`	`4096^2`

We can see significant drop in performance.

However, we can also see that for 4096^2 it varies between in 1-10 ms range.

1 ms sample	10 ms sample

Now, let's do the same, but synchronize only at the end of the loop:

function main(; N)
    dst = ROCArray{Float64}(undef, N)
    src = ROCArray{Float64}(undef, N)
    nthreads = 256
    nblocks = cld(N, nthreads)
    for i in 1:20
        @roc groupsize=nthreads gridsize=nblocks mycopy!(dst, src)
    end
    AMDGPU.synchronize()
    return
end

We now see that kernel launches are adjacent and that for 4096^2 time went significantly down (first kernel dispatch probably hit low frequencies or cold cache).

`2048^2`	`4096^2`

It could be that synchronization causes performance to drop (dropping frequencies or hitting cold cache every time). But I'm not sure why AMDGPU#master which uses HSA backend does not have it.

pxl-th · 2023-05-31T14:52:20Z

On Julia nightly with LLVM 15 only one test fails (non-contiguous softmax):

Contiguous & non-contiguous dims: Error During Test at /home/pxl-th/.julia/dev/AMDGPU/test/dnn/softmax.jl:1
 Got exception outside of a @test
 LLVM error: Cannot select: 0x1bd1bfe8: v16f16 = X86ISD::FMAX nnan ninf nsz arcp contract afn reassoc 0x1ecadbe0, 0x1bf41cc0, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
   0x1ecadbe0: v16f16,ch = CopyFromReg 0x1dcb7728, Register:v16f16 %9, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
     0x1d4c7290: v16f16 = Register %9
   0x1bf41cc0: v16f16,ch = CopyFromReg 0x1dcb7728, Register:v16f16 %10, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
     0x1bf41778: v16f16 = Register %10
 In function: julia_mapfoldl_impl_8505
 Stacktrace:
   [1] handle_error(reason::Cstring)
     @ LLVM ~/.julia/packages/LLVM/bsdku/src/core/context.jl:118
   [2] _mapreduce_dim(f::Function, op::Function, nt::Float16, A::Vector{Float16}, ::Colon)
     @ Base ./reducedim.jl:362
   [3] mapreduce(f::typeof(identity), op::Function, A::Vector{Float16}; dims::Colon, init::Float16)
     @ Base ./reducedim.jl:357 [inlined]
   [4] reduce(op::Function, A::Vector{Float16}; kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, @NamedTuple{dims::Colon, init::Float16}})
     @ Base ./reducedim.jl:406 [inlined]
   [5] fast_maximum(x::Vector{Float16}; dims::Function)
     @ NNlib ~/.julia/packages/NNlib/Jmwx0/src/softmax.jl:91
   [6] softmax!(out::Vector{Float16}, x::Vector{Float16}; dims::Function)
     @ NNlib ~/.julia/packages/NNlib/Jmwx0/src/softmax.jl:61
   [7] softmax(x::Vector{Float16}; dims::Colon)
     @ NNlib ~/.julia/packages/NNlib/Jmwx0/src/softmax.jl:56 [inlined]

@vchuravy @jpsamaroo

Everything else passes.

UPD: Ooh, this is actually CPU softmax from NNlib (we compare against NNlib in tests), unrealted to AMDGPU.

pxl-th · 2023-05-31T15:50:51Z

ROCm 5.5 still fails:

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-DEV.1406 (2023-05-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ca3270b06f4 (0 days old master)
|__/                   |

julia> using AMDGPU

julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm-5.5.0/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm-5.5.0/lib/libamdhip64.so
rocBLAS (ready)
- Path: /opt/rocm-5.5.0/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm-5.5.0/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm-5.5.0/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm-5.5.0/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm-5.5.0/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm-5.5.0/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm-5.5.0/lib/libMIOpen.so
HSA Agents (2):
- CPU-XX [AMD Ryzen 7 5800X 8-Core Processor]
- GPU-XX [AMD Radeon RX 6700 XT (gfx1030)]

julia> x = AMDGPU.rand(Float32, 1024);

julia> sin.(x)
error: Opaque pointers are only supported in -opaque-pointers mode (Producer: 'LLVM16.0.0git' Reader: 'LLVM 15.0.7jl')

vchuravy · 2023-05-31T16:39:55Z

Yeah for ROCM 5.5 we need to build the device library with LLVM 14/LLVM 15.

Can you post the LLVM15 issue standalone to JuliaLang e.e. simple MWE that I can look at?

pxl-th · 2023-05-31T18:22:15Z

Yeah for ROCM 5.5 we need to build the device library with LLVM 14/LLVM 15.

But what is needed to be able to run with LLVM 16? Eventually we'd need that I guess

docs/src/profiling.md

luraess · 2023-05-31T18:57:12Z

What bothers me here #423 (comment) is that I don't get why sync time should be sensitive to vector size.

Updating the benchmark results from #423 (comment) with timing now

tm = 0.0
for it in 1:1000
    if (it == 100) tm = time_ns() end
    @roc groupsize=nthreads gridsize=nblocks mycopy!(dst,src)
end
AMDGPU.synchronize()
tm = (time_ns() - tm) * 1e-9
mtp = 2*N*sizeof(eltype(dst))*900/tm * 1e-9 # GB/s

results in following perf:

It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x).

vchuravy · 2023-05-31T18:57:23Z

But what is needed to be able to run with LLVM 16? Eventually we'd need that I guess

My point is that there are two different issues.

Julia reads the device bitcode. That means the bitcode needs to be compatible with Julia's LLVM version.
HIP & Co use a different version of LLVM, that's fine by us since we use ELF binaries as an interface.

So as long as we can generate the device bitcode file with compatible settings we can use newer versions of LLVM for HIP.

pxl-th · 2023-05-31T22:40:10Z

It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x).

What results are you getting if we remove all the unnecessary stuff, like with the code below?

using AMDGPU
using CairoMakie

function mycopy!(dst, src)
    i = (workgroupIdx().x - UInt32(1)) * workgroupDim().x + workitemIdx().x
    @inbounds dst[i] = src[i]
    return
end

function main(; N)
    dst = ROCArray{Float64}(undef, N)
    src = ROCArray{Float64}(undef, N)

    nthreads = 256
    nblocks = cld(N, nthreads)
    @assert N % nthreads == 0

    GC.gc()
    GC.enable(false)

    iters, warmup = 1000, 100
    stream = AMDGPU.stream()
    kernel = @roc launch=false mycopy!(dst, src)

    for i in 1:warmup
        kernel(dst, src; stream, gridsize=nblocks, groupsize=nthreads)
    end
    AMDGPU.synchronize(stream) # Otherwise we'll be timing warmup kernels as well.

    t_start = time_ns()
    for i in 1:(iters - warmup)
        kernel(dst, src; stream, gridsize=nblocks, groupsize=nthreads)
    end
    AMDGPU.synchronize(stream)
    t_end = time_ns()

    tm = (t_end - t_start) * 1e-9
    mtp = 2 * N * sizeof(eltype(dst)) * (iters - warmup) / tm * 1e-9 # GB/s

    AMDGPU.unsafe_free!(dst)
    AMDGPU.unsafe_free!(src)
    GC.enable(true)
    return mtp
end

function run_benchmark(N_rng)
    B_memcopy = [main(; N) for N in N_rng]
    fig = Figure(fontsize=24)
    ax  = Axis(fig[1, 1]; xscale=log2, xlabel="n", ylabel="GB/s")
    scatterlines!(ax, N_rng, B_memcopy; label="memcopy AMDGPU")
    axislegend(ax; position=:lt)
    return fig
end

save("memcopy.png", run_benchmark((32 * 2 .^ (1:10)) .^ 2))

pxl-th · 2023-05-31T22:41:25Z

I'll also add timing using HIP events: https://docs.amd.com/bundle/HIP-API-Guide-v5.4/page/a00185.html#gad4128b815cb475c8e13c7e66ff6250b7

That way we'll be able to measure more precisely.

luraess · 2023-06-01T06:24:29Z

It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x).

What results are you getting if we remove all the unnecessary stuff, like with the code below?

Running the new code you suggest improves perf of about 10%. Now peak is at 1.38TB/s (compared to 1.26TB/s with previous version)

luraess · 2023-06-01T08:55:44Z

Note that Julia still hangs on exit() using this branch:

julia> exit()
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
unknown function (ip: 0x1524ef71d9ab)
unknown function (ip: 0x1ffe)
unknown function (ip: 0x36f81ff)

pxl-th · 2023-06-01T12:39:06Z

Latest commit now does not cache configuration of binary dependencies and does runtime discovery.
This should fix: #424

pxl-th · 2023-06-01T12:42:34Z

Note that Julia still hangs on exit() using this branch:

julia> exit()
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
unknown function (ip: 0x1524ef71d9ab)
unknown function (ip: 0x1ffe)
unknown function (ip: 0x36f81ff)

I'm not able to reproduce this easily, but it happens during HIP stream finalization.
Now it should print something like "[!] Error in HIPStream finalizer: code - description".
If you see it, please mention it here.

Some of the functions that require devlibs are not there during 'link_libraries!' stage, but appear much later. As a temporary fix, link them on 'finish_ir!' stage.

vchuravy · 2023-07-06T06:31:23Z

Congratulations! That was quite the PR :)

pxl-th · 2023-07-06T07:03:57Z

Thank you :) It was well worth it!

Krastanov · 2023-07-06T07:12:07Z

Hey, folks! I am an AMDGPU user, so I have been following the repository, but I am not experienced enough to understand the reasons behind this (very impressive looking) effort. In the last couple of months I have also started trying so curate efforts like this and make them more visible and understandable to layman in the new julia newsletter.

This seems like one such effort, but the problem is that I am one of the laymen in this situation. Could you consider writing a short paragraph about why this is important and what is the difference between HIP and HSA so that I can include it in the newsletter?

pxl-th · 2023-07-06T08:36:20Z

@Krastanov thanks!

This PR moved everything to be on HIP streams.
Before, we'd launch Julia-generated kernels on HSA queues, while things like matmul (rocBLAS), convolutions (MIOpen), etc. on HIP streams.

But because HSA queue does not know anything about HIP stream (and vice versa), you needed to synchronize on the host to prevent racing.
Consider (x .* y) * z:

x .* y was dispatched on HSA queue
we waited on host for it to complete
only then we'd launch ... * z on HIP stream

This slowed things down quite a lot.

Moving everything to HIP allows us to skip synchronization on the host for x .* y and launch matmul immediately after, because everything is stream-ordered.

Besides that, memory operations are now asynchronous and stream-ordered as well.
This means we no longer need to "preserve" arguments for the duration of the kernel to avoid accidental freeing of the resources.

Besides HIP related stuff, there are some improvements to global hostcalls, which are now launched and paused automatically when they are used.
This also helps the performance as they each run in their own thread constantly pooling signals that they use for GPU <-> CPU communication.

And many other small changes.

As for some numbers, all this improved the performance of (yet unreleased) StableDiffusion from 40 seconds to 7-8 seconds on RX 6700 XT (20 steps of diffusion).

luraess reviewed May 31, 2023

View reviewed changes

docs/src/profiling.md Outdated Show resolved Hide resolved

This was linked to issues Jun 1, 2023

Test failures locally on 1.9.0-beta4 -- Radeon 6800XT #400

Closed

First install with JULIA_AMDGPU_DISABLE_ARTIFACTS leads to broken config #424

Closed

pxl-th added 17 commits June 30, 2023 21:25

Introduce HostCallHolder & detect, start & pause global hostcalls

f2ea8bb

Link device libraries in 'finish_ir!' stage

6864493

Some of the functions that require devlibs are not there during 'link_libraries!' stage, but appear much later. As a temporary fix, link them on 'finish_ir!' stage.

Destroy global hostcall

6a9634d

Cleanup

83af2a3

Disable global hostcalls in tests

b25cc54

Fix compilation params

d2e3d70

Update docs

d2dd762

Minor correction

08d58b4

Adapt printing hostcall to new interface

9e57629

Trigger CI

f25cf1d

Trigger CI

48d28c4

Trigger CI

682222a

Trigger CI

a5bf7ee

Use auto threads in CI

51e29c3

Use 8 threads in CI

63aae51

Fix typo

3a52f2e

Restrict CI to Navi 2 for now

e869fec

pxl-th merged commit 3171d2b into master Jul 6, 2023

pxl-th deleted the pxl-th/hiprtc branch July 6, 2023 05:47

This was referenced Jul 6, 2023

rocBLAS synchronization issue? #418

Closed

Optimize wait! for HSA kernel launches #405

Closed

Update HIP errors codes #404

Closed

This was referenced Jul 6, 2023

Update ROCm artifacts to newer version #440

Open

Pass exception flag through kernel state #367

Closed

CI: Re-enable Julia 1.7 CI #329

Closed

@assert errors on ROCArrays if device ≠ 1 #388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HIP as kernel backend instead of HSA #423

Use HIP as kernel backend instead of HSA #423

pxl-th commented May 19, 2023 •

edited

Loading

vchuravy commented May 23, 2023

pxl-th commented May 23, 2023

pxl-th commented May 23, 2023

vchuravy commented May 23, 2023

vchuravy commented May 23, 2023

pxl-th commented May 25, 2023 •

edited

Loading

pxl-th commented May 25, 2023 •

edited

Loading

vchuravy commented May 25, 2023

vchuravy commented May 25, 2023

pxl-th commented May 25, 2023 •

edited

Loading

pxl-th commented May 25, 2023

pxl-th commented May 30, 2023 •

edited

Loading

pxl-th commented May 30, 2023

luraess commented May 30, 2023 •

edited

Loading

pxl-th commented May 31, 2023 •

edited

Loading

pxl-th commented May 31, 2023 •

edited

Loading

pxl-th commented May 31, 2023

vchuravy commented May 31, 2023

pxl-th commented May 31, 2023 •

edited

Loading

luraess commented May 31, 2023 •

edited

Loading

vchuravy commented May 31, 2023

pxl-th commented May 31, 2023 •

edited

Loading

pxl-th commented May 31, 2023

luraess commented Jun 1, 2023

luraess commented Jun 1, 2023

pxl-th commented Jun 1, 2023 •

edited

Loading

pxl-th commented Jun 1, 2023

vchuravy commented Jul 6, 2023

pxl-th commented Jul 6, 2023

Krastanov commented Jul 6, 2023

pxl-th commented Jul 6, 2023

Use HIP as kernel backend instead of HSA #423

Use HIP as kernel backend instead of HSA #423

Conversation

pxl-th commented May 19, 2023 • edited Loading

vchuravy commented May 23, 2023

pxl-th commented May 23, 2023

pxl-th commented May 23, 2023

vchuravy commented May 23, 2023

vchuravy commented May 23, 2023

pxl-th commented May 25, 2023 • edited Loading

pxl-th commented May 25, 2023 • edited Loading

vchuravy commented May 25, 2023

vchuravy commented May 25, 2023

pxl-th commented May 25, 2023 • edited Loading

pxl-th commented May 25, 2023

pxl-th commented May 30, 2023 • edited Loading

pxl-th commented May 30, 2023

luraess commented May 30, 2023 • edited Loading

pxl-th commented May 31, 2023 • edited Loading

pxl-th commented May 31, 2023 • edited Loading

pxl-th commented May 31, 2023

vchuravy commented May 31, 2023

pxl-th commented May 31, 2023 • edited Loading

luraess commented May 31, 2023 • edited Loading

vchuravy commented May 31, 2023

pxl-th commented May 31, 2023 • edited Loading

pxl-th commented May 31, 2023

luraess commented Jun 1, 2023

luraess commented Jun 1, 2023

pxl-th commented Jun 1, 2023 • edited Loading

pxl-th commented Jun 1, 2023

vchuravy commented Jul 6, 2023

pxl-th commented Jul 6, 2023

Krastanov commented Jul 6, 2023

pxl-th commented Jul 6, 2023

pxl-th commented May 19, 2023 •

edited

Loading

pxl-th commented May 25, 2023 •

edited

Loading

pxl-th commented May 25, 2023 •

edited

Loading

pxl-th commented May 25, 2023 •

edited

Loading

pxl-th commented May 30, 2023 •

edited

Loading

luraess commented May 30, 2023 •

edited

Loading

pxl-th commented May 31, 2023 •

edited

Loading

pxl-th commented May 31, 2023 •

edited

Loading

pxl-th commented May 31, 2023 •

edited

Loading

luraess commented May 31, 2023 •

edited

Loading

pxl-th commented May 31, 2023 •

edited

Loading

pxl-th commented Jun 1, 2023 •

edited

Loading