Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HIP as kernel backend instead of HSA #423

Merged
merged 96 commits into from
Jul 6, 2023
Merged

Use HIP as kernel backend instead of HSA #423

merged 96 commits into from
Jul 6, 2023

Conversation

pxl-th
Copy link
Member

@pxl-th pxl-th commented May 19, 2023

  • Bump minimum required ROCm version to 5.3 because of stream-ordered allocator.

  • Use HIP modules instead of HSA.

  • Remove SyncState. Since we now use only HIPStreams, there is no need to keep track of different kernel launches.

  • Fix failing tests for findmax.
    Fixes: Test failures locally on 1.9.0-beta4 -- Radeon 6800XT #400 (comment)

  • Do not specialize on shared memory size in reduce kernels.

  • Do not cache binary dependency config, do runtime discovery of them.
    Fixes: First install with JULIA_AMDGPU_DISABLE_ARTIFACTS leads to broken config #424

  • Report exception frames (when launching with -g2).
    Additionally, exception reporting now does not rely on hostcalls, thus there is no performance penalty.
    And with exec-once gate ( macro) we prevent duplication of exceptions from multiple threads and multiple grid elements as can be seen from the example below.
    This also allows us to report precise location of the exception.

julia> using AMDGPU
[ Info: Precompiling AMDGPU [21141c5a-9bdb-4563-92ae-f87d6854732e]

julia> function f(x)
           x[2] = 0
           return
       end
f (generic function with 1 method)

julia> x = ROCArray{Int}(undef, 1);

julia> @roc gridsize=64 groupsize=128 f(x);

julia> AMDGPU.synchronize()
ERROR: Out-of-bounds array access.
ERROR: a exception was thrown during kernel execution.
Stacktrace:
 [1] #throw_boundserror
   @ /home/pxl-th/.julia/dev/AMDGPU/src/device/quirks.jl:6
 [2] checkbounds
   @ ./abstractarray.jl:709
 [3] #setindex!
   @ /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/array.jl:88
 [4] f
   @ ./REPL[2]:2

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] synchronize(s::HIPStream)
   @ AMDGPU ~/.julia/dev/AMDGPU/src/highlevel.jl:155
 [3] synchronize()
   @ AMDGPU ~/.julia/dev/AMDGPU/src/highlevel.jl:154
 [4] top-level scope
   @ REPL[4]:1

@vchuravy
Copy link
Member

Pass LLVM bitcode to HIP for kernel compilation.

That is not ideal. Is there a way for HIP to accept so files? The issue here is that the primary source of incompatibilities between the ROCM stack and Julia is LLVM version mismatch, I think we can fix parts of it but we need a "neutral" interchange format like an elf file.

I wonder if we could reduce the overhead coming the other way, HIP uses HSA internally and maybe they could expose an interface to access the HSA events behind a stream?

Generally in favor for using KernelState for exception handling

@pxl-th
Copy link
Member Author

pxl-th commented May 23, 2023

Is there a way for HIP to accept so files?

Yes, there is. Just confirmed that the previous way we did compilation works.
My thinking was that we (potentially) could apply missing target-specific optimizations if we pass LLVM bitcode to HIP.


This PR in particular tries to solve the following issue (besides making things faster):

  1. Skip host synchronization when it is safe to do so #419 removed wait on kernel launches and performed synchronization only at hsa | hip boundaries.
    So hsa | hsa or hip | hip relies on on-device serialization.
    But this surfaced an issue where GC would kick in too fast, freeing the arrays that were passed to hip kernels.
  2. To fix this, I've moved all array allocations to HIP thus making them async and stream-ordered: Migrate to stream-ordered allocator for arrays #422.
    However, HIP uses memory pools and you can limit their size only starting from ROCm 5.5: Support for setting hipLimitMallocHeapSize on ROCm 5.4? ROCm/hipamd#70
    Otherwise, memory pool grows too big, and steals all resources from HSA queue for kernel dispatch.
    Trying to limit pool size on the Julia side seems unreliable (or costly).
  3. Finally, moving kernels to HIP should hopefully fix this and remove the need to synchronize at hsa | hip boundaries.

@pxl-th
Copy link
Member Author

pxl-th commented May 23, 2023

However, there seems to be no way to define extinit global variables like there is with HSA:
HSA.executable_agent_global_variable_define

So the way we did exceptions, printing, malloc, etc. via hostcalls does not work and requires changes...

@vchuravy
Copy link
Member

My thinking was that we (potentially) could apply missing target-specific optimizations if we pass LLVM bitcode to HIP

Maybe, but I am more worried about the LLVM inconsistencies. We are fixing a little bit of this on the Julia side come Julia v1.10, but fixing the "requiring matched LLVM versions" is a more important thing to fix.

If we can give HIP the so I am all for this. Death to HSA, long live HIP 😢

@vchuravy
Copy link
Member

However, there seems to be no way to define extinit global variables like there is with HSA:

Hm I thought comgr handled that. Maybe we need to take another look how hipcc deals with these.

@pxl-th
Copy link
Member Author

pxl-th commented May 25, 2023

There is a strange bug with reporting exception frames though (see PR description about frame reporting addition).
It works when module is precompiled at using AMDGPU (e.g. you changed the source code).
Then it works all the time without issues in that Julia session.

But the next time you do using AMDGPU (without any changes to the source code) without precompilation and execute kernel that throws an exception it segfaults:

[120672] signal (11.1): Segmentation fault
in expression starting at none:0
jl_valid_type_param at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:1282 [inlined]
jl_f_apply_type at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:1328
unsafe_load at /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/output.jl:140
unknown function (ip: 0x7f18540ba622)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
#33 at /home/pxl-th/.julia/dev/AMDGPU/src/compiler/output_context.jl:38
unknown function (ip: 0x7f18540b9d05)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:730
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/hostcall.jl:305 [inlined]
#39 at ./threadingconstructs.jl:373
unknown function (ip: 0x7f18540b5e1f)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 15884493 (Pool: 15871716; Big: 12777); GC: 23
Segmentation fault (core dumped)

The line in question is:

T = unsafe_pointer_to_objref(T_ptr)

Where it tries to get the type from pointer to it.
The pointer itself was obtained from:

@generated function _pointer_from_type(::Type{T}) where T
UInt64(pointer_from_objref(T))
end

And written to the hostcall buffer on the device.

Maybe when there is no precompilation, things go wrong (maybe something goes to cache...) :/
@jpsamaroo or @vchuravy maybe you have some suggestions...

This happens for Base.Cstring type.

@pxl-th
Copy link
Member Author

pxl-th commented May 25, 2023

In the first session with precompilation we get matching pointers to Cstring:

T_ptr                         0x00007f5e3f1c2060
pointer_from_objref(Cstring): 0x00007f5e3f1c2060

In the second session without precompilation we get different:

T_ptr                         0x00007f5e3f1c2060
pointer_from_objref(Cstring): 0x00007f8ae8bc2060

And T_ptr matches the one from the first session.

@vchuravy
Copy link
Member

Yeah that seems like an illegal generated function and if it happens to be cached things will go badly

@vchuravy
Copy link
Member

We will have to come up with a better solution when we want to support caching anyway... I would try just making it pure? Instead of generated?

@pxl-th
Copy link
Member Author

pxl-th commented May 25, 2023

I would try just making it pure? Instead of generated?

Same result. BTW, this happens only for exception reporting. Regular @rocprintf calls from kernels work and they all use that method.

@pxl-th
Copy link
Member Author

pxl-th commented May 25, 2023

Placing the code of that function where it was called also gives the same result.
So the problem may be somewhere deeper

@pxl-th
Copy link
Member Author

pxl-th commented May 30, 2023

Nerf.jl benchmark, 1000 training steps, 512 batch size:

54.938175 seconds (8.76 M allocations: 418.460 MiB, 2.47% gc time, 0.02% compilation time)
  • This PR:
41.944204 seconds (53.07 M allocations: 947.432 MiB, 0.41% gc time, 0.01% compilation time)

Need to debug what's causing so many allocations.

@pxl-th
Copy link
Member Author

pxl-th commented May 30, 2023

The downside is that we now have even more memory pressure which leads to the same situation where hsa queue is unable to allocate resources it needs for kernel dispatch.

:0:rocdevice.cpp            :2672: 10404529258 us: 80780: [tid:0x7fe7263ff640]
    Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES:
        The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008

[80780] signal (6.-6): Aborted

This is likely because we are now dispatching on the host too fast and not freeing allocations in time.

I had to resort to more manual memory freeing, like adding AMDGPU.Mem.definitely_free function (naming is temporary and function itself should be macro).

Which captures all device allocations during user's function execuion and frees them immediately after.
Example usage is Flux & Zygote, where you can't call (or add) unsafe_free! in the model's forward pass.

model = # some Flux model for example
AMDGPU.Mem.definitely_free() do
   gradients = Zygote.gradient(model) do m
       y = model(x)
   end
   apply!(optimizer, gradients, model)
end

This may be fixed by ROCm 5.5 where you can specify hard limit on the memory pool size.
Although it is weird that HSA queue suffers from this.
Instead of failing on kernel dispatches it should fail for user allocations and have enough reserved memory for a queue to operate... :/

@luraess
Copy link
Collaborator

luraess commented May 30, 2023

Testing the following memcpy code trying to saturate the memory bandwidth on MI250x results in following scaling:

  • using AMDGPU#master (peak ~1.38 TB/s):
    memcopy_master

  • using this PR (peak ~1.27 TB/s):
    memcopy

Code used:
⚠️ nblocks definition needs to ba adapted depending which branch is tested (on #master it = grid)

using AMDGPU
using Printf
using CairoMakie
using BenchmarkTools

function mycopy!(dst,src)
    I = (AMDGPU.workgroupIdx().x-1)*AMDGPU.workgroupDim().x + AMDGPU.workitemIdx().x
    if I <= length(dst)
        @inbounds dst[I] = src[I]
    end
    return
end

function main(; N=256)
    dst = ROCArray{Float64}(undef,N)
    src = ROCArray{Float64}(undef,N)
    nthreads = 256
    # nblocks  = N # master
    nblocks  = cld(N,nthreads) # pxl-th/hiprtc
    GC.gc()
    GC.enable(false)
    # profile
    tm = @belapsed begin
        @roc groupsize=$nthreads gridsize=$nblocks mycopy!($dst,$src)
        AMDGPU.synchronize()
    end# evals=10 samples=100
    mtp = 2*N*sizeof(eltype(dst))/tm * 1e-9 # GB/s
    # @printf "mem_rate = %1.2e \n" mtp
    GC.enable(true)
    return mtp
end

function run_benchmark(N_rng)
    println("memcopy AMDGPU"); B_memcopy = [main(N=N) for N in N_rng]
    fig = Figure(fontsize=24)
    ax  = Axis(fig[1,1];xscale=log2,xlabel="n",ylabel="GB/s")
    scatterlines!(ax,N_rng,B_memcopy;label="memcopy AMDGPU")
    axislegend(ax; position=:lt)
    return fig
end

save("memcopy.png",run_benchmark( (32*2 .^ (1:10)).^2 ))

@pxl-th
Copy link
Member Author

pxl-th commented May 31, 2023

So the drop in performance happens at 2048^2 - 4096^2 (reproducible on Navi 2 as well).

Looking at simplified version of the code in profiler (notice, here we synchronize each iteration).

function main(; N)
    dst = ROCArray{Float64}(undef, N)
    src = ROCArray{Float64}(undef, N)
    nthreads = 256
    nblocks = cld(N, nthreads)
    for i in 1:20
        @roc groupsize=nthreads gridsize=nblocks mycopy!(dst, src)
        AMDGPU.synchronize()
    end
    return
end
2048^2 4096^2
image image

We can see significant drop in performance.

However, we can also see that for 4096^2 it varies between in 1-10 ms range.

1 ms sample 10 ms sample
image image

Now, let's do the same, but synchronize only at the end of the loop:

function main(; N)
    dst = ROCArray{Float64}(undef, N)
    src = ROCArray{Float64}(undef, N)
    nthreads = 256
    nblocks = cld(N, nthreads)
    for i in 1:20
        @roc groupsize=nthreads gridsize=nblocks mycopy!(dst, src)
    end
    AMDGPU.synchronize()
    return
end

We now see that kernel launches are adjacent and that for 4096^2 time went significantly down (first kernel dispatch probably hit low frequencies or cold cache).

2048^2 4096^2
image image

It could be that synchronization causes performance to drop (dropping frequencies or hitting cold cache every time). But I'm not sure why AMDGPU#master which uses HSA backend does not have it.

@pxl-th
Copy link
Member Author

pxl-th commented May 31, 2023

On Julia nightly with LLVM 15 only one test fails (non-contiguous softmax):

Contiguous & non-contiguous dims: Error During Test at /home/pxl-th/.julia/dev/AMDGPU/test/dnn/softmax.jl:1
 Got exception outside of a @test
 LLVM error: Cannot select: 0x1bd1bfe8: v16f16 = X86ISD::FMAX nnan ninf nsz arcp contract afn reassoc 0x1ecadbe0, 0x1bf41cc0, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
   0x1ecadbe0: v16f16,ch = CopyFromReg 0x1dcb7728, Register:v16f16 %9, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
     0x1d4c7290: v16f16 = Register %9
   0x1bf41cc0: v16f16,ch = CopyFromReg 0x1dcb7728, Register:v16f16 %10, array.jl:938 @[ reduce.jl:60 @[ reduce.jl:48 @[ reduce.jl:44 ] ] ]
     0x1bf41778: v16f16 = Register %10
 In function: julia_mapfoldl_impl_8505
 Stacktrace:
   [1] handle_error(reason::Cstring)
     @ LLVM ~/.julia/packages/LLVM/bsdku/src/core/context.jl:118
   [2] _mapreduce_dim(f::Function, op::Function, nt::Float16, A::Vector{Float16}, ::Colon)
     @ Base ./reducedim.jl:362
   [3] mapreduce(f::typeof(identity), op::Function, A::Vector{Float16}; dims::Colon, init::Float16)
     @ Base ./reducedim.jl:357 [inlined]
   [4] reduce(op::Function, A::Vector{Float16}; kw::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, @NamedTuple{dims::Colon, init::Float16}})
     @ Base ./reducedim.jl:406 [inlined]
   [5] fast_maximum(x::Vector{Float16}; dims::Function)
     @ NNlib ~/.julia/packages/NNlib/Jmwx0/src/softmax.jl:91
   [6] softmax!(out::Vector{Float16}, x::Vector{Float16}; dims::Function)
     @ NNlib ~/.julia/packages/NNlib/Jmwx0/src/softmax.jl:61
   [7] softmax(x::Vector{Float16}; dims::Colon)
     @ NNlib ~/.julia/packages/NNlib/Jmwx0/src/softmax.jl:56 [inlined]

@vchuravy @jpsamaroo

Everything else passes.

UPD: Ooh, this is actually CPU softmax from NNlib (we compare against NNlib in tests), unrealted to AMDGPU.

@pxl-th
Copy link
Member Author

pxl-th commented May 31, 2023

ROCm 5.5 still fails:

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-DEV.1406 (2023-05-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ca3270b06f4 (0 days old master)
|__/                   |

julia> using AMDGPU

julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm-5.5.0/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm-5.5.0/lib/libamdhip64.so
rocBLAS (ready)
- Path: /opt/rocm-5.5.0/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm-5.5.0/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm-5.5.0/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm-5.5.0/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm-5.5.0/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm-5.5.0/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm-5.5.0/lib/libMIOpen.so
HSA Agents (2):
- CPU-XX [AMD Ryzen 7 5800X 8-Core Processor]
- GPU-XX [AMD Radeon RX 6700 XT (gfx1030)]

julia> x = AMDGPU.rand(Float32, 1024);

julia> sin.(x)
error: Opaque pointers are only supported in -opaque-pointers mode (Producer: 'LLVM16.0.0git' Reader: 'LLVM 15.0.7jl')

@vchuravy
Copy link
Member

Yeah for ROCM 5.5 we need to build the device library with LLVM 14/LLVM 15.

Can you post the LLVM15 issue standalone to JuliaLang e.e. simple MWE that I can look at?

@pxl-th
Copy link
Member Author

pxl-th commented May 31, 2023

Yeah for ROCM 5.5 we need to build the device library with LLVM 14/LLVM 15.

But what is needed to be able to run with LLVM 16? Eventually we'd need that I guess

docs/src/profiling.md Outdated Show resolved Hide resolved
@luraess
Copy link
Collaborator

luraess commented May 31, 2023

What bothers me here #423 (comment) is that I don't get why sync time should be sensitive to vector size.

Updating the benchmark results from #423 (comment) with timing now

tm = 0.0
for it in 1:1000
    if (it == 100) tm = time_ns() end
    @roc groupsize=nthreads gridsize=nblocks mycopy!(dst,src)
end
AMDGPU.synchronize()
tm = (time_ns() - tm) * 1e-9
mtp = 2*N*sizeof(eltype(dst))*900/tm * 1e-9 # GB/s

results in following perf:
memcopy_3

It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x).

@vchuravy
Copy link
Member

But what is needed to be able to run with LLVM 16? Eventually we'd need that I guess

My point is that there are two different issues.

  1. Julia reads the device bitcode. That means the bitcode needs to be compatible with Julia's LLVM version.
  2. HIP & Co use a different version of LLVM, that's fine by us since we use ELF binaries as an interface.

So as long as we can generate the device bitcode file with compatible settings we can use newer versions of LLVM for HIP.

@pxl-th
Copy link
Member Author

pxl-th commented May 31, 2023

It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x).

What results are you getting if we remove all the unnecessary stuff, like with the code below?

using AMDGPU
using CairoMakie

function mycopy!(dst, src)
    i = (workgroupIdx().x - UInt32(1)) * workgroupDim().x + workitemIdx().x
    @inbounds dst[i] = src[i]
    return
end

function main(; N)
    dst = ROCArray{Float64}(undef, N)
    src = ROCArray{Float64}(undef, N)

    nthreads = 256
    nblocks = cld(N, nthreads)
    @assert N % nthreads == 0

    GC.gc()
    GC.enable(false)

    iters, warmup = 1000, 100
    stream = AMDGPU.stream()
    kernel = @roc launch=false mycopy!(dst, src)

    for i in 1:warmup
        kernel(dst, src; stream, gridsize=nblocks, groupsize=nthreads)
    end
    AMDGPU.synchronize(stream) # Otherwise we'll be timing warmup kernels as well.

    t_start = time_ns()
    for i in 1:(iters - warmup)
        kernel(dst, src; stream, gridsize=nblocks, groupsize=nthreads)
    end
    AMDGPU.synchronize(stream)
    t_end = time_ns()

    tm = (t_end - t_start) * 1e-9
    mtp = 2 * N * sizeof(eltype(dst)) * (iters - warmup) / tm * 1e-9 # GB/s

    AMDGPU.unsafe_free!(dst)
    AMDGPU.unsafe_free!(src)
    GC.enable(true)
    return mtp
end

function run_benchmark(N_rng)
    B_memcopy = [main(; N) for N in N_rng]
    fig = Figure(fontsize=24)
    ax  = Axis(fig[1, 1]; xscale=log2, xlabel="n", ylabel="GB/s")
    scatterlines!(ax, N_rng, B_memcopy; label="memcopy AMDGPU")
    axislegend(ax; position=:lt)
    return fig
end

save("memcopy.png", run_benchmark((32 * 2 .^ (1:10)) .^ 2))

@pxl-th
Copy link
Member Author

pxl-th commented May 31, 2023

I'll also add timing using HIP events: https://docs.amd.com/bundle/HIP-API-Guide-v5.4/page/a00185.html#gad4128b815cb475c8e13c7e66ff6250b7

That way we'll be able to measure more precisely.

@luraess
Copy link
Collaborator

luraess commented Jun 1, 2023

It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x).

What results are you getting if we remove all the unnecessary stuff, like with the code below?

Running the new code you suggest improves perf of about 10%. Now peak is at 1.38TB/s (compared to 1.26TB/s with previous version)

memcopy_new

@luraess
Copy link
Collaborator

luraess commented Jun 1, 2023

Note that Julia still hangs on exit() using this branch:

julia> exit()
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
unknown function (ip: 0x1524ef71d9ab)
unknown function (ip: 0x1ffe)
unknown function (ip: 0x36f81ff)

@pxl-th
Copy link
Member Author

pxl-th commented Jun 1, 2023

Latest commit now does not cache configuration of binary dependencies and does runtime discovery.
This should fix: #424

@pxl-th
Copy link
Member Author

pxl-th commented Jun 1, 2023

Note that Julia still hangs on exit() using this branch:

julia> exit()
^C^C^C^C^CWARNING: Force throwing a SIGINT
error in running finalizer: InterruptException()
unknown function (ip: 0x1524ef71d9ab)
unknown function (ip: 0x1ffe)
unknown function (ip: 0x36f81ff)

I'm not able to reproduce this easily, but it happens during HIP stream finalization.
Now it should print something like "[!] Error in HIPStream finalizer: code - description".
If you see it, please mention it here.

@pxl-th pxl-th merged commit 3171d2b into master Jul 6, 2023
@pxl-th pxl-th deleted the pxl-th/hiprtc branch July 6, 2023 05:47
@vchuravy
Copy link
Member

vchuravy commented Jul 6, 2023

Congratulations! That was quite the PR :)

@pxl-th
Copy link
Member Author

pxl-th commented Jul 6, 2023

Thank you :) It was well worth it!

@Krastanov
Copy link

Hey, folks! I am an AMDGPU user, so I have been following the repository, but I am not experienced enough to understand the reasons behind this (very impressive looking) effort. In the last couple of months I have also started trying so curate efforts like this and make them more visible and understandable to layman in the new julia newsletter.

This seems like one such effort, but the problem is that I am one of the laymen in this situation. Could you consider writing a short paragraph about why this is important and what is the difference between HIP and HSA so that I can include it in the newsletter?

@pxl-th
Copy link
Member Author

pxl-th commented Jul 6, 2023

@Krastanov thanks!

This PR moved everything to be on HIP streams.
Before, we'd launch Julia-generated kernels on HSA queues, while things like matmul (rocBLAS), convolutions (MIOpen), etc. on HIP streams.

But because HSA queue does not know anything about HIP stream (and vice versa), you needed to synchronize on the host to prevent racing.
Consider (x .* y) * z:

  • x .* y was dispatched on HSA queue
  • we waited on host for it to complete
  • only then we'd launch ... * z on HIP stream

This slowed things down quite a lot.

Moving everything to HIP allows us to skip synchronization on the host for x .* y and launch matmul immediately after, because everything is stream-ordered.

Besides that, memory operations are now asynchronous and stream-ordered as well.
This means we no longer need to "preserve" arguments for the duration of the kernel to avoid accidental freeing of the resources.

Besides HIP related stuff, there are some improvements to global hostcalls, which are now launched and paused automatically when they are used.
This also helps the performance as they each run in their own thread constantly pooling signals that they use for GPU <-> CPU communication.

And many other small changes.

As for some numbers, all this improved the performance of (yet unreleased) StableDiffusion from 40 seconds to 7-8 seconds on RX 6700 XT (20 steps of diffusion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants