Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel hang related to noreturn function attributes #113

Open
maleadt opened this issue Mar 3, 2023 · 30 comments
Open

Kernel hang related to noreturn function attributes #113

maleadt opened this issue Mar 3, 2023 · 30 comments
Labels
kernels Things about kernels and how they are compiled. upstream Out of our hands

Comments

@maleadt
Copy link
Member

maleadt commented Mar 3, 2023

Although Metal.jl works fine on Julia 1.9 locally, it for some reason fails on CI. Maybe this is related to the juliaecosystem machines running an outdated macOS (12.4, while I'm running 13, but 12.6 was also reported to work fine in #85 (comment)).

@christiangnrd
Copy link
Contributor

christiangnrd commented Mar 3, 2023

I'm on 13.2.1 but I can reproduce the hanging behaviour during tests locally. Except for my particular machine, instead of hanging during gpuarrays/random, it consistently hangs during gpuarrays/broadcasting.

I compared activity monitor behaviour between julia --project test/runstests gpuarrays/broadcasting (1.8.5) and julia +beta --project test/runstests gpuarrays/broadcasting (1.9.0-beta4) and in 1.8, for the julia process, cpu is at 97% the whole time and gpu is at ~20% until the test completes, while in 1.9 beta, the cpu is at 100% for a few seconds at the start while gpu is at ~4% for the julia process, then it very quickly drops to 0% for both and the process never finishes.

Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there.

@maleadt
Copy link
Member Author

maleadt commented Mar 3, 2023

Another thing I've noticed is that when I stop the test with ctrl+c, I get Distributed.jl warnings telling me that the process was not removed, and indeed looking at activity monitor, I have 5 Julia processes that shouldn't be there.

That's just the test suite runner capturing your interrupt and exiting. One way to run tests in isolation is, from the Metal.jl repository, do something like julia --project -e 'using Pkg; Pkg.test(; julia_args=gpuarrays/broadcast), where the positional args passed as julia_args indicate the tests you want to run (try passing --help for more information). Hopefully the backtrace you get then from interrupting the process is more informative? Ideally we'd isolate this down to the single operation that makes Metal.jl hang on 1.9.

@christiangnrd
Copy link
Contributor

I'm having trouble getting the test to run outside of the test suite runner because it comes from GPUArrays and I can't figure out how to call it from Metal.

I also did a bisect of Julia, and it seems like JuliaLang/julia@a12c2f0 is the commit that caused the issue (or at least caused it to surface).

@maleadt
Copy link
Member Author

maleadt commented Mar 4, 2023

To run code from the GPUArrays test suite, you can do something like:

julia --project=test -L test/setup.jl
julia> AT=MtlArray
julia> eltypes = [Int16, Int32, Int64, Complex{Int16}, Complex{Int32}, Complex{Int64}, ComplexF16, ComplexF32]

That should set-up the required environment. Note that you also might have to start Julia with --check-bounds=yes, like the test runner does.

@christiangnrd
Copy link
Contributor

christiangnrd commented Mar 4, 2023

I found where the test hangs for my machine.
This line, more specifically, here when it tries to run getindex.(Ref(x), 1) on the gpu array. I looked at the generated llvm and I'll be posting each version in their own comment. The one difference of note between the two is that the llvm code from 1.9.0-beta 4 has store atomic {}* %0, {}** %14 release, align 8 that is missing in the 1.8.5 version.

The code I'm referring to is right above the line that looks like

; └└

@christiangnrd
Copy link
Contributor

christiangnrd commented Mar 4, 2023

Julia 1.8.5 output of @code_llvm f(gpu_in...) where f is x->getindex.(Ref(x), 1), and gpu_in is (Int16[0],):

;  @ none within `#315`
define i16 @"julia_#315_1053"({}* nonnull align 8 dereferenceable(32) %0) #0 {
top:
  %gcframe4 = alloca [3 x {}*], align 16
  %gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe4 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
  %3 = alloca { { {}*, i64 } }, align 8
  %4 = call {}*** inttoptr (i64 7226199708 to {}*** (i64)*)(i64 260) #4
; ┌ @ refpointer.jl:134 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %5 = bitcast [3 x {}*]* %gcframe4 to i64*
    store i64 4, i64* %5, align 16
    %6 = load {}**, {}*** %4, align 8
    %7 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
    %8 = bitcast {}** %7 to {}***
    store {}** %6, {}*** %8, align 8
    %9 = bitcast {}*** %4 to {}***
    store {}** %gcframe4.sub, {}*** %9, align 8
    %ptls_field5 = getelementptr inbounds {}**, {}*** %4, i64 2
    %10 = bitcast {}*** %ptls_field5 to i8**
    %ptls_load67 = load i8*, i8** %10, align 8
    %11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #5
    %12 = bitcast {}* %11 to i64*
    %13 = getelementptr inbounds i64, i64* %12, i64 -1
    store atomic i64 4547519648, i64* %13 unordered, align 8
    %14 = bitcast {}* %11 to {}**
    store {}* %0, {}** %14, align 8                                       ; HERE
; └└
; ┌ @ broadcast.jl:860 within `materialize`
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 0
   store {}* %11, {}** %2, align 16
   store {}* %11, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 1
   store i64 1, i64* %.fca.0.1.gep, align 8
   %15 = call i16 @j_copy_1055({ { {}*, i64 } }* nocapture readonly %3) #0
   %16 = load {}*, {}** %7, align 8
   %17 = bitcast {}*** %4 to {}**
   store {}* %16, {}** %17, align 8
; └
  ret i16 %15
}

@christiangnrd
Copy link
Contributor

christiangnrd commented Mar 4, 2023

Julia 1.9.0-beta4 output of @code_llvm f(gpu_in...) where f is x->getindex.(Ref(x), 1), and gpu_in is (Int16[0],):

;  @ none within `#315`
define i16 @"julia_#315_1018"({}* noundef nonnull align 8 dereferenceable(32) %0) #0 {
top:
  %gcframe4 = alloca [3 x {}*], align 16
  %gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe4 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %2 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
  %3 = alloca { { {}*, i64 } }, align 8
  %4 = call {}*** inttoptr (i64 7226199708 to {}*** (i64)*)(i64 261) #3
; ┌ @ refpointer.jl:136 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %5 = bitcast [3 x {}*]* %gcframe4 to i64*
    store i64 4, i64* %5, align 16
    %6 = load {}**, {}*** %4, align 8
    %7 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
    %8 = bitcast {}** %7 to {}***
    store {}** %6, {}*** %8, align 8
    %9 = bitcast {}*** %4 to {}***
    store {}** %gcframe4.sub, {}*** %9, align 8
    %ptls_field5 = getelementptr inbounds {}**, {}*** %4, i64 2
    %10 = bitcast {}*** %ptls_field5 to i8**
    %ptls_load67 = load i8*, i8** %10, align 8
    %11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #4
    %12 = bitcast {}* %11 to i64*
    %13 = getelementptr inbounds i64, i64* %12, i64 -1
    store atomic i64 6245911824, i64* %13 unordered, align 8
    %14 = bitcast {}* %11 to {}**
    store {}* null, {}** %14, align 8                                     ; HERE
    store atomic {}* %0, {}** %14 release, align 8                        ; AND ALSO HERE
; └└
; ┌ @ broadcast.jl:873 within `materialize`
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 0
   store {}* %11, {}** %2, align 16
   store {}* %11, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %3, i64 0, i32 0, i32 1
   store i64 1, i64* %.fca.0.1.gep, align 8
   %15 = call i16 @j_copy_1020({ { {}*, i64 } }* nocapture readonly %3) #0
   %16 = load {}*, {}** %7, align 8
   %17 = bitcast {}*** %4 to {}**
   store {}* %16, {}** %17, align 8
; └
  ret i16 %15
}

@christiangnrd
Copy link
Contributor

I was hoping the removal of libcmt would magically fix the 1.9 ci issues but unfortunately the tests still hang.

@maleadt
Copy link
Member Author

maleadt commented Mar 10, 2023

Sadly this looks codegen related; the atomic store is probably a good clue. I hope I'll have time to investigate next week, but it's problematic that I can't reproduce this.

Which hardware do you have exactly?

@christiangnrd
Copy link
Contributor

I have a 30-core M2 Max Macbook Pro and all of my comments have been using that computer.

I also have access to a base model M1 Mac mini in my lab where I was able to reproduce the hang.

@christiangnrd
Copy link
Contributor

'I ran test Metal many times today on the lab M1 and I have some gist of test outputs. All were run using the currently released version so I don’t know how useful they’ll be, but I got some really weird results on 1.8.5 and 1.9.0-rc1.

Of the three 1.8.5 tests I ran and saved, the first one had a failure in gpuarrays/reductions/== isequal (and gpuarrays/math/power had no tests??), but the other 2 I ran right after passed. I tried to reproduce this failure on my M2 Max but all my 1.8.5 runs passed.
Gist for these tests

Then, I ran the Metal 0.2.0 tests on 1.9.0-rc1, and for the first time since seeing this issue, a test pass completed on 1.9.0, although with some errors. Gist

Hopeful, I reran it on 1.9.0-rc1 twice and unfortunately, both times the test hanged, with some errors in previous tests. The second time had a new error in the unified memory example (hint?) Gist 1, Gist 2

All of the gists are from M1 runs. On my M2 Max, all tests consistently pass on 1.8.5, and I never get any errors on 1.9.0-rc1 (other than broadcasting hanging).

I don’t know how useful these gists will be, but I figure since you can’t reproduce, I might as well give you as much as you can. Last time I had a bug like this that was very inconsistent, it ended being that I wasn’t initializing some values but they weren’t caught in debug mode because all the memory gets 0 initialized when running the debugger.

I’ll run the tests a few more times in the background this weekend from the Master branch to see if anything has changed since there’s been quite a few changes.

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

I haven't had the time to reproduce (probably only next week), but since you have a system on which the tests consistently hang: can you post the MWE that makes it hang in a clean session, and could you try running with julia -g2 or after setting MTL_DEBUG_LAYER=1 and MTL_SHADER_VALIDATION=1 in your environment before loading Metal.jl?

@christiangnrd
Copy link
Contributor

MWE (running from the Metal folder):

# Only run the failing test
$ julia --project=test -e'using Pkg; Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="metal-mwe-hang")'

# 1.8.5: Passes
$ julia --project=test test/runtests.jl 'gpuarrays/broadcasting'

# 1.8.5: Passes
$ julia -g2 --project=test test/runtests.jl 'gpuarrays/broadcasting'

# 1.9.0-rc1: Fails (hang)
$ Julia +beta --project=test test/runtests.jl 'gpuarrays/broadcasting'

# 1.9.0-rc1: Passes
$ Julia +beta -g2 --project=test test/runtests.jl 'gpuarrays/broadcasting'

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

That's not really a minimal example, can you reduce it to any of the tests that make the GPU hang? You mentioned x->getindex.(Ref(x), 1) above?

EDIT: OK, I have access to a system on which this hangs as well. I'll try reducing next week.

@christiangnrd
Copy link
Contributor

The metal-mwe-hang branch has all broadcasting tests but the x->getindex.(Ref(x), 1) one commented out so it should pass or hang within a minute. I'm trying to reduce it but at the moment my code works.

Faster to run code of the same as above

using Pkg;
Pkg.activate(temp=true);
Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="metal-mwe-hang")
Pkg.add(url="https://github.com/JuliaGPU/Metal.jl", rev="main")
Pkg.test("Metal", test_args=["gpuarrays/broadcasting"])

@maleadt
Copy link
Member Author

maleadt commented Mar 15, 2023

So it doesn't hang if you run that operation in isolation? Did you try with --check-bounds=yes, to mimic the test runner?

@christiangnrd
Copy link
Contributor

I forgot about --check-bounds=yes. Here's a real MWE for you:

using Pkg;
Pkg.activate(temp=true);
Pkg.add(url="https://github.com/JuliaGPU/Metal.jl", rev="main")
using Metal

getindex.(Ref(MtlArray([0])), 1)

Pasting the above code into the REPL after starting julia in the following ways did not hang:

  • julia --check-bounds=yes (1.8.5)
  • MTL_SHADER_VALIDATION=1 julia --check-bounds=yes (1.8.5)
  • MTL_SHADER_VALIDATION=1 julia +beta --check-bounds=yes (1.9.0-rc1)

However, when julia was started with j +beta --check-bounds=yes, it hangs.

I dumped the generated llvm code for each version. Both 1.8.5 versions were identical, I'm pretty sure both 1.9.0-rc1 versions are identical (some function names different), but I put both in in case I'm wrong. The difference between 1.8.5 and 1.9.0-rc1 is the noundef in the function definition.

Setting MTL_DEBUG_LAYER=1 in my environment before launching julia made no difference in the results.

@code_llvm for 1.8.5 (both)

;  @ none within `##dotfunction#312#1`
define i64 @"julia_##dotfunction#312#1_626"({}* nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  %4 = alloca { { {}*, i64 } }, align 8
  %5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 260) #3
; ┌ @ broadcast.jl:860 within `materialize`
   %6 = bitcast [3 x {}*]* %gcframe2 to i64*
   store i64 4, i64* %6, align 16
   %7 = load {}**, {}*** %5, align 8
   %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
   %9 = bitcast {}** %8 to {}***
   store {}** %7, {}*** %9, align 8
   %10 = bitcast {}*** %5 to {}***
   store {}** %gcframe2.sub, {}*** %10, align 8
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
   store {}* %0, {}** %3, align 16
   store {}* %0, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
   store i64 %1, i64* %.fca.0.1.gep, align 8
   %11 = call i64 @j_copy_628({ { {}*, i64 } }* nocapture readonly %4) #0
   %12 = load {}*, {}** %8, align 8
   %13 = bitcast {}*** %5 to {}**
   store {}* %12, {}** %13, align 8
; └
  ret i64 %11
}

@code_llvm for 1.9.0-rc1 with shader validation (no hang)

2023-03-15 18:43:39.747 julia[55303:1772369] Metal GPU Validation Enabled
;  @ none within `##dotfunction#292#3`
define i64 @"julia_##dotfunction#292#3_554"({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  %4 = alloca { { {}*, i64 } }, align 8
  %5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 261) #2
; ┌ @ broadcast.jl:873 within `materialize`
   %6 = bitcast [3 x {}*]* %gcframe2 to i64*
   store i64 4, i64* %6, align 16
   %7 = load {}**, {}*** %5, align 8
   %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
   %9 = bitcast {}** %8 to {}***
   store {}** %7, {}*** %9, align 8
   %10 = bitcast {}*** %5 to {}***
   store {}** %gcframe2.sub, {}*** %10, align 8
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
   store {}* %0, {}** %3, align 16
   store {}* %0, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
   store i64 %1, i64* %.fca.0.1.gep, align 8
   %11 = call i64 @j_copy_556({ { {}*, i64 } }* nocapture readonly %4) #0
   %12 = load {}*, {}** %8, align 8
   %13 = bitcast {}*** %5 to {}**
   store {}* %12, {}** %13, align 8
; └
  ret i64 %11
}

@code_llvm for 1.9.0-rc1 with no shader validation (hang)

;  @ none within `##dotfunction#292#3`
define i64 @"julia_##dotfunction#292#3_528"({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %2 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %2, i8 0, i32 24, i1 false)
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  %4 = alloca { { {}*, i64 } }, align 8
  %5 = call {}*** inttoptr (i64 7005032092 to {}*** (i64)*)(i64 261) #2
; ┌ @ broadcast.jl:873 within `materialize`
   %6 = bitcast [3 x {}*]* %gcframe2 to i64*
   store i64 4, i64* %6, align 16
   %7 = load {}**, {}*** %5, align 8
   %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
   %9 = bitcast {}** %8 to {}***
   store {}** %7, {}*** %9, align 8
   %10 = bitcast {}*** %5 to {}***
   store {}** %gcframe2.sub, {}*** %10, align 8
   %.fca.0.0.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 0
   store {}* %0, {}** %3, align 16
   store {}* %0, {}** %.fca.0.0.gep, align 8
   %.fca.0.1.gep = getelementptr inbounds { { {}*, i64 } }, { { {}*, i64 } }* %4, i64 0, i32 0, i32 1
   store i64 %1, i64* %.fca.0.1.gep, align 8
   %11 = call i64 @j_copy_530({ { {}*, i64 } }* nocapture readonly %4) #0
   %12 = load {}*, {}** %8, align 8
   %13 = bitcast {}*** %5 to {}**
   store {}* %12, {}** %13, align 8
; └
  ret i64 %11
}

@maleadt
Copy link
Member Author

maleadt commented Mar 16, 2023

Thanks! This gets us much closer to something debuggable.

FYI, the debug layer needs both MTL_DEBUG_LAYER and MTL_SHADER_VALIDATION set, the first one enables debug mode, the second one toggles a specific kind of debug validation.

@maleadt
Copy link
Member Author

maleadt commented Mar 17, 2023

Looks like that specific failure doesn't reproduce anymore after JuliaGPU/GPUArrays.jl#454, so let's try bumping GPUArrays to at least work around the immediate issue.

@maleadt
Copy link
Member Author

maleadt commented Mar 17, 2023

MWE:

using Metal

function kernel(dest, nelem)
    j = 0
    while j < nelem
        j += 1
        i = Metal.thread_position_in_grid_1d() + (j-1) * Metal.threads_per_grid_1d()
        i > length(dest) && return
        I = @inbounds CartesianIndices(dest)[i]
        @inbounds dest[I] = 42
    end
    return
end

arr = MtlArray{Int64}(undef)
Metal.@sync @metal kernel(arr, 1)

@maleadt
Copy link
Member Author

maleadt commented Mar 17, 2023

I've spent some time debugging this, and I don't notice significant differences between the --check-bounds=yes IR on 1.8 and 1.9. Specifically, there were two differences:

  • some additional noreturn function attributes
  • a missing SDK Version module flag (well, it's there but with uninitialized values)

The former doesn't seem to be the culprit, I think (after manually stripping those attributes and still reproducing the hang). The latter may be related, but I wonder why our LLVM back-end messes up here. We have this code, https://github.com/JuliaGPU/llvm-metal/blob/llvm_release_14/llvm/lib/Target/Metal/Metal.cpp#L285-L324, and strangely if I compile our back-end it just sets the metadata correctly. I wonder if something's up with the Yggdrasil build.

Instead of debugging this, I'm going to try to set this flag from Julia, see maleadt/LLVM.jl#329. Can't test this right now though as the machine where I could reproduce this has died 🤦

EDIT: Setting the SDK version didn't help.

@christiangnrd
Copy link
Contributor

Bumping GPUArrays seems to have fixed the hanging for me. If I understand correctly this gets around the issue by not calling the problematic code but the problem still exists?

@maleadt
Copy link
Member Author

maleadt commented Mar 17, 2023

Yeah there's still an issue.

@maleadt
Copy link
Member Author

maleadt commented Mar 20, 2023

Reduced the hang to the following IR:

; ModuleID = 'kernel.ll'
source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.2.1"

; Function Attrs: cold noreturn nounwind
declare void @llvm.trap() #0

; Function Attrs: noinline
define internal void @throw() #1 {
top:
  tail call void @llvm.trap()
  unreachable
}

define cc103 void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
  %2 = load i64, i64 addrspace(1)* %1, align 8
  %.not2 = icmp sgt i64 %2, 0
  br i1 %.not2, label %oob, label %exit

oob:                                              ; preds = %entry
  tail call void @throw()
  unreachable

exit:                                             ; preds = %entry
  ret void
}

attributes #0 = { cold noreturn nounwind }
attributes #1 = { noinline }

!air.kernel = !{!0}
!air.version = !{!5}
!llvm.module.flags = !{!6}

!0 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !1, !2}
!1 = !{}
!2 = !{!3, !4}
!3 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!4 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 2, i32 4, i32 0}
!6 = !{i32 2, !"SDK Version", [2 x i32] [i32 13, i32 2]}

After compiling this IR with our Metal back-end:

using Metal

function main(path)
    metallib = read(path)

    dev = current_device()
    lib = MTLLibraryFromData(dev, metallib)
    fun = MTLFunction(lib, "kernel")
    pipeline = MTLComputePipelineState(dev, fun)

    f = identity
    ft = typeof(f)
    tt = Tuple{ft, Tuple{MtlDeviceArray{Int64, 0, 1}, Int64}}
    kernel = Metal.HostKernel{ft, tt}(f, pipeline)

    arr = MtlArray{Int64}(undef)
    println("Waiting...")
    Metal.@sync kernel(arr, 1)
end

isinteractive() || main(ARGS...)

This hangs when the metallib was generated by our back-end based on LLVM 14, but not when using the LLVM 13 version. The difference:

 ; ModuleID = 'bc_module'
 source_filename = "text"
@@ -38,7 +38,8 @@
 ; Function Attrs: cold noreturn nounwind
 declare void @llvm.trap() #0

-define internal void @throw() {
+; Function Attrs: noinline
+define internal void @throw() #1 {
 top:
   tail call void @llvm.trap()
   unreachable
@@ -59,6 +60,7 @@
 }

 attributes #0 = { cold noreturn nounwind }
+attributes #1 = { noinline }

 !air.kernel = !{!0}
 !air.version = !{!5}

i.e. on LLVM 13 we drop the noinline attr which causes the unreachable to get inlined. When outlined, it hangs. This does seem like a bug in the Metal compiler, however, it does once again trace back to divergent control flow (like JuliaGPU/CUDAnative.jl#4, or now JuliaGPU/CUDA.jl#1746, which has been plaguing us for years).

kernels.zip

@maleadt
Copy link
Member Author

maleadt commented Mar 31, 2023

ObjC loader:

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        if (argc != 2) {
            NSLog(@"Usage: %s [Metal Library Filename]", argv[0]);
            return 1;
        }

        NSString *libraryFilePath = [NSString stringWithUTF8String:argv[1]];
        NSError *error = nil;

        id<MTLDevice> device = MTLCreateSystemDefaultDevice();
        if (!device) {
            NSLog(@"Metal is not supported on this device");
            return 1;
        }

        NSURL *libraryFileURL = [NSURL fileURLWithPath:libraryFilePath];
        id<MTLLibrary> library = [device newLibraryWithURL:libraryFileURL error:&error];
        if (!library) {
            NSLog(@"Failed to create Metal library: %@", error);
            return 1;
        }

        id<MTLFunction> kernelFunction = [library newFunctionWithName:@"kernel"];
        if (!kernelFunction) {
            NSLog(@"Failed to find the 'kernel' function");
            return 1;
        }

        id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithFunction:kernelFunction error:&error];
        if (!pipeline) {
            NSLog(@"Failed to create compute pipeline state: %@", error);
            return 1;
        }

        id<MTLCommandQueue> commandQueue = [device newCommandQueue];
        id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
        id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];

        [computeEncoder setComputePipelineState:pipeline];

        NSUInteger bufferSize = sizeof(int64_t);
        id<MTLBuffer> buffer1 = [device newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
        id<MTLBuffer> buffer2 = [device newBufferWithBytes:&(int64_t){1} length:sizeof(int64_t) options:MTLResourceStorageModeShared];

        [computeEncoder setBuffer:buffer1 offset:0 atIndex:0];
        [computeEncoder setBuffer:buffer2 offset:0 atIndex:1];

        MTLSize gridSize = MTLSizeMake(1, 1, 1);
        MTLSize threadgroupSize = MTLSizeMake(1, 1, 1);
        [computeEncoder dispatchThreadgroups:gridSize threadsPerThreadgroup:threadgroupSize];

        [computeEncoder endEncoding];

        MTLCommandBufferHandler completionHandler = ^(id<MTLCommandBuffer> cb) {
            NSLog(@"Kernel execution completed");
        };

        [commandBuffer addCompletedHandler:completionHandler];
        [commandBuffer commit];
        NSLog(@"Waiting...");
        [commandBuffer waitUntilCompleted];
    }
    return 0;
}

@maleadt
Copy link
Member Author

maleadt commented Apr 28, 2023

Hmm, I can actually reconstruct this IR using a Metal kernel:

#include <metal_stdlib>
using namespace metal;

struct Array {
    device int8_t *data;
};

__attribute__((noinline)) void perform_throw() {
    __builtin_trap();
}

kernel void kernel_fun(device Array *a, device int64_t *b [[ buffer(0) ]]) {
    if (*b > 0)
        perform_throw();
}

... but that one executes correctly. Trying to narrow down the differences, it looks like a metadata-related issue.

@maleadt
Copy link
Member Author

maleadt commented Apr 28, 2023

So with the following base IR:

; ModuleID = 'bc_module'
source_filename = "kernel"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx13.0.0"

declare void @llvm.trap()

; Function Attrs: noinline
define internal fastcc void @throw() #0 {
  tail call void @llvm.trap()
  unreachable
}

define void @kernel({ i8 addrspace(1)* } addrspace(1)* %0, i64 addrspace(1)* %1) {
entry:
  %2 = load i64, i64 addrspace(1)* %1, align 8
  %.not2 = icmp sgt i64 %2, 0
  br i1 %.not2, label %oob, label %exit

oob:                                              ; preds = %entry
  tail call void @throw()
  unreachable

exit:                                             ; preds = %entry
  ret void
}

attributes #0 = { noinline }

!air.version = !{!0}
!0 = !{i32 2, i32 4, i32 0}

... it works with the following metadata:

!air.kernel = !{!14}
!14 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !15, !16}
!15 = !{}
!16 = !{!17, !20}

!17 = !{i32 0, !"air.indirect_buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.struct_type_info", !18, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!18 = !{i32 0, i32 8, i32 0, !"char", !"data", !"air.indirect_argument", !19}
!19 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 1, !"air.arg_type_align_size", i32 1}

!20 = !{i32 1, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}

... but fails with what we emit:

!air.kernel = !{!1}
!1 = !{void ({ i8 addrspace(1)* } addrspace(1)*, i64 addrspace(1)*)* @kernel, !2, !3}
!2 = !{}
!3 = !{!4, !5}
!4 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!5 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, !"air.read_write", !"air.address_space", i32 1, !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}

So this is the original metadata issue again (where we emit a simple buffer, for bindless oepration, while Metal apparently expects a fleshed out metadata tree).

@maleadt maleadt changed the title CI failures on 1.9 Kernel hang related to noreturn function attributes May 19, 2023
@maleadt
Copy link
Member Author

maleadt commented May 19, 2023

Let's narrow this issue down to the kernel hang seen with noreturn function attributes. I have put a workaround in place in GPUCompiler (unreleased as of now, so use the master branch if you want to test this), but will keep this issue open so that we don't forget about it.

Disabling the workaround and running the MWE above on --check-bounds=yes still reproduces the hang.

@maleadt maleadt added kernels Things about kernels and how they are compiled. bug labels May 22, 2023
@maleadt maleadt added the upstream Out of our hands label Jun 19, 2024
@maleadt
Copy link
Member Author

maleadt commented Jun 19, 2024

Looks like this workaround isn't required anymore on 14.5 using an M3, however, disabling it seems to cause other test failures. At least no hangs, though.

@maleadt
Copy link
Member Author

maleadt commented Jun 19, 2024

Actually, it seems like running the MWE causes the GPU to hang (100% usage as reported by asitop) even though the synchronize returns. That 100% usage then causes test failures where we're (presumably) missing some synchronizations...

@tgymnich tgymnich removed the bug label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernels Things about kernels and how they are compiled. upstream Out of our hands
Projects
None yet
Development

No branches or pull requests

3 participants