-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use HIP as kernel backend instead of HSA #423
Conversation
That is not ideal. Is there a way for HIP to accept so files? The issue here is that the primary source of incompatibilities between the ROCM stack and Julia is LLVM version mismatch, I think we can fix parts of it but we need a "neutral" interchange format like an elf file. I wonder if we could reduce the overhead coming the other way, HIP uses HSA internally and maybe they could expose an interface to access the HSA events behind a stream? Generally in favor for using KernelState for exception handling |
Yes, there is. Just confirmed that the previous way we did compilation works. This PR in particular tries to solve the following issue (besides making things faster):
|
However, there seems to be no way to define extinit global variables like there is with HSA: So the way we did exceptions, printing, malloc, etc. via hostcalls does not work and requires changes... |
Maybe, but I am more worried about the LLVM inconsistencies. We are fixing a little bit of this on the Julia side come Julia v1.10, but fixing the "requiring matched LLVM versions" is a more important thing to fix. If we can give HIP the so I am all for this. Death to HSA, long live HIP 😢 |
Hm I thought comgr handled that. Maybe we need to take another look how hipcc deals with these. |
There is a strange bug with reporting exception frames though (see PR description about frame reporting addition). But the next time you do [120672] signal (11.1): Segmentation fault
in expression starting at none:0
jl_valid_type_param at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:1282 [inlined]
jl_f_apply_type at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:1328
unsafe_load at /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/output.jl:140
unknown function (ip: 0x7f18540ba622)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
#33 at /home/pxl-th/.julia/dev/AMDGPU/src/compiler/output_context.jl:38
unknown function (ip: 0x7f18540b9d05)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/builtins.c:730
macro expansion at /home/pxl-th/.julia/dev/AMDGPU/src/device/gcn/hostcall.jl:305 [inlined]
#39 at ./threadingconstructs.jl:373
unknown function (ip: 0x7f18540b5e1f)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 15884493 (Pool: 15871716; Big: 12777); GC: 23
Segmentation fault (core dumped) The line in question is: AMDGPU.jl/src/device/gcn/output.jl Line 140 in afa0508
Where it tries to get the type from pointer to it. AMDGPU.jl/src/device/gcn/output.jl Lines 163 to 165 in afa0508
And written to the hostcall buffer on the device. Maybe when there is no precompilation, things go wrong (maybe something goes to cache...) :/ This happens for |
In the first session with precompilation we get matching pointers to
In the second session without precompilation we get different:
And |
Yeah that seems like an illegal generated function and if it happens to be cached things will go badly |
We will have to come up with a better solution when we want to support caching anyway... I would try just making it pure? Instead of generated? |
Same result. BTW, this happens only for exception reporting. Regular |
Placing the code of that function where it was called also gives the same result. |
Nerf.jl benchmark, 1000 training steps, 512 batch size:
Need to debug what's causing so many allocations. |
The downside is that we now have even more memory pressure which leads to the same situation where hsa queue is unable to allocate resources it needs for kernel dispatch.
This is likely because we are now dispatching on the host too fast and not freeing allocations in time. I had to resort to more manual memory freeing, like adding Which captures all device allocations during user's function execuion and frees them immediately after. model = # some Flux model for example
AMDGPU.Mem.definitely_free() do
gradients = Zygote.gradient(model) do m
y = model(x)
end
apply!(optimizer, gradients, model)
end This may be fixed by ROCm 5.5 where you can specify hard limit on the memory pool size. |
Testing the following memcpy code trying to saturate the memory bandwidth on MI250x results in following scaling: Code used: using AMDGPU
using Printf
using CairoMakie
using BenchmarkTools
function mycopy!(dst,src)
I = (AMDGPU.workgroupIdx().x-1)*AMDGPU.workgroupDim().x + AMDGPU.workitemIdx().x
if I <= length(dst)
@inbounds dst[I] = src[I]
end
return
end
function main(; N=256)
dst = ROCArray{Float64}(undef,N)
src = ROCArray{Float64}(undef,N)
nthreads = 256
# nblocks = N # master
nblocks = cld(N,nthreads) # pxl-th/hiprtc
GC.gc()
GC.enable(false)
# profile
tm = @belapsed begin
@roc groupsize=$nthreads gridsize=$nblocks mycopy!($dst,$src)
AMDGPU.synchronize()
end# evals=10 samples=100
mtp = 2*N*sizeof(eltype(dst))/tm * 1e-9 # GB/s
# @printf "mem_rate = %1.2e \n" mtp
GC.enable(true)
return mtp
end
function run_benchmark(N_rng)
println("memcopy AMDGPU"); B_memcopy = [main(N=N) for N in N_rng]
fig = Figure(fontsize=24)
ax = Axis(fig[1,1];xscale=log2,xlabel="n",ylabel="GB/s")
scatterlines!(ax,N_rng,B_memcopy;label="memcopy AMDGPU")
axislegend(ax; position=:lt)
return fig
end
save("memcopy.png",run_benchmark( (32*2 .^ (1:10)).^2 )) |
On Julia nightly with LLVM 15 only one test fails (non-contiguous softmax):
Everything else passes. UPD: Ooh, this is actually CPU softmax from NNlib (we compare against NNlib in tests), unrealted to AMDGPU. |
ROCm 5.5 still fails: _ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.10.0-DEV.1406 (2023-05-31)
_/ |\__'_|_|_|\__'_| | Commit ca3270b06f4 (0 days old master)
|__/ |
julia> using AMDGPU
julia> AMDGPU.versioninfo()
Using ROCm provided by: System
HSA Runtime (ready)
- Path: /opt/rocm-5.5.0/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
HIP Runtime (ready)
- Path: /opt/rocm-5.5.0/lib/libamdhip64.so
rocBLAS (ready)
- Path: /opt/rocm-5.5.0/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm-5.5.0/lib/librocsolver.so
rocALUTION (ready)
- Path: /opt/rocm-5.5.0/lib/librocalution.so
rocSPARSE (ready)
- Path: /opt/rocm-5.5.0/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm-5.5.0/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm-5.5.0/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm-5.5.0/lib/libMIOpen.so
HSA Agents (2):
- CPU-XX [AMD Ryzen 7 5800X 8-Core Processor]
- GPU-XX [AMD Radeon RX 6700 XT (gfx1030)]
julia> x = AMDGPU.rand(Float32, 1024);
julia> sin.(x)
error: Opaque pointers are only supported in -opaque-pointers mode (Producer: 'LLVM16.0.0git' Reader: 'LLVM 15.0.7jl') |
Yeah for ROCM 5.5 we need to build the device library with LLVM 14/LLVM 15. Can you post the LLVM15 issue standalone to JuliaLang e.e. simple MWE that I can look at? |
But what is needed to be able to run with LLVM 16? Eventually we'd need that I guess |
What bothers me here #423 (comment) is that I don't get why sync time should be sensitive to vector size. Updating the benchmark results from #423 (comment) with timing now tm = 0.0
for it in 1:1000
if (it == 100) tm = time_ns() end
@roc groupsize=nthreads gridsize=nblocks mycopy!(dst,src)
end
AMDGPU.synchronize()
tm = (time_ns() - tm) * 1e-9
mtp = 2*N*sizeof(eltype(dst))*900/tm * 1e-9 # GB/s It's still interesting one can "only" reach ~76% of announced peak memory throughput (1.6TB/s on MI250x). |
My point is that there are two different issues.
So as long as we can generate the device bitcode file with compatible settings we can use newer versions of LLVM for HIP. |
What results are you getting if we remove all the unnecessary stuff, like with the code below? using AMDGPU
using CairoMakie
function mycopy!(dst, src)
i = (workgroupIdx().x - UInt32(1)) * workgroupDim().x + workitemIdx().x
@inbounds dst[i] = src[i]
return
end
function main(; N)
dst = ROCArray{Float64}(undef, N)
src = ROCArray{Float64}(undef, N)
nthreads = 256
nblocks = cld(N, nthreads)
@assert N % nthreads == 0
GC.gc()
GC.enable(false)
iters, warmup = 1000, 100
stream = AMDGPU.stream()
kernel = @roc launch=false mycopy!(dst, src)
for i in 1:warmup
kernel(dst, src; stream, gridsize=nblocks, groupsize=nthreads)
end
AMDGPU.synchronize(stream) # Otherwise we'll be timing warmup kernels as well.
t_start = time_ns()
for i in 1:(iters - warmup)
kernel(dst, src; stream, gridsize=nblocks, groupsize=nthreads)
end
AMDGPU.synchronize(stream)
t_end = time_ns()
tm = (t_end - t_start) * 1e-9
mtp = 2 * N * sizeof(eltype(dst)) * (iters - warmup) / tm * 1e-9 # GB/s
AMDGPU.unsafe_free!(dst)
AMDGPU.unsafe_free!(src)
GC.enable(true)
return mtp
end
function run_benchmark(N_rng)
B_memcopy = [main(; N) for N in N_rng]
fig = Figure(fontsize=24)
ax = Axis(fig[1, 1]; xscale=log2, xlabel="n", ylabel="GB/s")
scatterlines!(ax, N_rng, B_memcopy; label="memcopy AMDGPU")
axislegend(ax; position=:lt)
return fig
end
save("memcopy.png", run_benchmark((32 * 2 .^ (1:10)) .^ 2)) |
I'll also add timing using HIP events: https://docs.amd.com/bundle/HIP-API-Guide-v5.4/page/a00185.html#gad4128b815cb475c8e13c7e66ff6250b7 That way we'll be able to measure more precisely. |
Running the new code you suggest improves perf of about 10%. Now peak is at 1.38TB/s (compared to 1.26TB/s with previous version) |
Note that Julia still hangs on
|
Latest commit now does not cache configuration of binary dependencies and does runtime discovery. |
I'm not able to reproduce this easily, but it happens during HIP stream finalization. |
Some of the functions that require devlibs are not there during 'link_libraries!' stage, but appear much later. As a temporary fix, link them on 'finish_ir!' stage.
Congratulations! That was quite the PR :) |
Thank you :) It was well worth it! |
Hey, folks! I am an AMDGPU user, so I have been following the repository, but I am not experienced enough to understand the reasons behind this (very impressive looking) effort. In the last couple of months I have also started trying so curate efforts like this and make them more visible and understandable to layman in the new julia newsletter. This seems like one such effort, but the problem is that I am one of the laymen in this situation. Could you consider writing a short paragraph about why this is important and what is the difference between HIP and HSA so that I can include it in the newsletter? |
@Krastanov thanks! This PR moved everything to be on HIP streams. But because HSA queue does not know anything about HIP stream (and vice versa), you needed to synchronize on the host to prevent racing.
This slowed things down quite a lot. Moving everything to HIP allows us to skip synchronization on the host for Besides that, memory operations are now asynchronous and stream-ordered as well. Besides HIP related stuff, there are some improvements to global hostcalls, which are now launched and paused automatically when they are used. And many other small changes. As for some numbers, all this improved the performance of (yet unreleased) StableDiffusion from 40 seconds to 7-8 seconds on RX 6700 XT (20 steps of diffusion). |
Bump minimum required ROCm version to 5.3 because of stream-ordered allocator.
Use HIP modules instead of HSA.
Remove
SyncState
. Since we now use onlyHIPStream
s, there is no need to keep track of different kernel launches.Fix failing tests for
findmax
.Fixes: Test failures locally on 1.9.0-beta4 -- Radeon 6800XT #400 (comment)
Do not specialize on shared memory size in reduce kernels.
Do not cache binary dependency config, do runtime discovery of them.
Fixes: First install with
JULIA_AMDGPU_DISABLE_ARTIFACTS
leads to broken config #424Report exception frames (when launching with
-g2
).Additionally, exception reporting now does not rely on hostcalls, thus there is no performance penalty.
And with exec-once gate (
⊡
macro) we prevent duplication of exceptions from multiple threads and multiple grid elements as can be seen from the example below.This also allows us to report precise location of the exception.