Invalid IR with StaticArrays.jl #363

pxl-th · 2022-10-26T11:57:18Z

Hi! Not really sure where to post this issue, but following kernel produces InvalidIRError on Julia 1.9 master, while on 1.8.2 it works fine:

using StaticArrays
using CUDA
using CUDAKernels
using KernelAbstractions

@inline function to_image_pos(xy::SVector{2, Float32}, resolution::SVector{2, UInt32})
    xy_res = floor.(UInt32, max.(0f0, xy) .* resolution) # <-- this line causes issues
    min.(resolution .- 1, xy_res) .+ 1
end

@kernel function f(y, x)
    i = @index(Global)
    width, height = size(x)
    pixel = to_image_pos(SVector{2, Float32}(0.5f0, 0.5f0), SVector{2, UInt32}(width, height))
    y[i] = sum(pixel)
end

function main()
    dev = CUDADevice()
    x = CUDA.ones(Float32, (8, 8))
    y = CUDA.zeros(Float32, 4)
    wait(f(dev, 128)(y, x; ndrange=4))
end
main()

However, if I split xy_res calculation into several steps, it works fine:

@inline function to_image_pos(xy::SVector{2, Float32}, resolution::SVector{2, UInt32})
    a = max.(0f0, xy)
    b = a .* resolution
    xy_res = floor.(UInt32, b)
    min.(resolution .- 1, xy_res) .+ 1
end

Error:

ERROR: InvalidIRError: compiling kernel #gpu_f(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CUDA.CuDeviceVector{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to floor)
Stacktrace:
  [1] newf
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:185
  [2] macro expansion
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:140
  [3] __broadcast
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:128
  [4] _broadcast
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:124
  [5] copy
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
  [6] materialize
    @ ./broadcast.jl:873
  [7] to_image_pos
    @ ~/code/INGP.jl/src/nerf/samples.jl:179
  [8] macro expansion
    @ ~/code/INGP.jl/src/INGP.jl:187
  [9] gpu_f
    @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [10] gpu_f
    @ ./none:0
Reason: unsupported call to an unknown function (call to jl_f_tuple)
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:140
 [2] __broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:128
 [3] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:124
 [4] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [5] materialize
   @ ./broadcast.jl:873
 [6] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [7] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [8] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [9] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to eltype)
Stacktrace:
 [1] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:125
 [2] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [3] materialize
   @ ./broadcast.jl:873
 [4] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [5] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [6] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [7] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to similar_type)
Stacktrace:
 [1] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:125
 [2] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [3] materialize
   @ ./broadcast.jl:873
 [4] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [5] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [6] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [7] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:125
 [2] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [3] materialize
   @ ./broadcast.jl:873
 [4] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [5] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [6] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [7] gpu_f
   @ ./none:0
Reason: unsupported call to an unknown function (call to jl_f_tuple)
Stacktrace:
 [1] broadcasted
   @ ./broadcast.jl:1319
 [2] broadcasted
   @ ./broadcast.jl:1317
 [3] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:180
 [4] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [5] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [6] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to Base.Broadcast.Broadcasted{StaticArraysCore.StaticArrayStyle{1}})
Stacktrace:
 [1] broadcasted
   @ ./broadcast.jl:1319
 [2] broadcasted
   @ ./broadcast.jl:1317
 [3] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:180
 [4] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [5] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [6] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to materialize)
Stacktrace:
 [1] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:180
 [2] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [3] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [4] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to sum)
Stacktrace:
 [1] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:188
 [2] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/DfvRa/src/device/array.jl:194
 [2] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:188
 [3] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [4] gpu_f
   @ ./none:0

The text was updated successfully, but these errors were encountered:

maleadt · 2022-10-26T12:14:45Z

Probably not something we can do much about in GPUCompiler, as this seems like a Julia 'regression' (and not even strictly so, because Julia is a dynamic language, so not all code is expected to compile statically). Especially the floor broadcast in there is far from guaranteed to compile statically.

If you care about this pattern, what I'd recommend you to do is to create a reproducer without GPUCompiler that emits static code with plain code_llvm on 1.8, but doesn't on 1.9; try to bisect Julia to the offending change; and file that as an upstream issue. It's possible that we should have to care about that regression, since broadcast shouldn't generally regress (assuming that the compilation regression reported here also manifests as a performance regression).

pxl-th · 2022-10-26T12:21:12Z

Thanks for suggestions!
I think there were indeed some regressions (at least with regard to StaticArrays.jl), because besides this issue, on Julia 1.9 I get

CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)

after several iterations (< 10) in my code.
While on 1.8.2 the same code runs fine for hundreds or thousands of iterations.

maleadt · 2022-10-26T12:23:19Z

It's interesting that this happens after a couple of iterations, as ERROR_LAUNCH_OUT_OF_RESOURCES should happen when launching a kernel and does AFAIK not depend on the state of the GPU (so it should happen on the first iteration). Generally it's a user error, related to the number of threads you choose to launch, which depends on the complexity of the kernel (e.g. number of registers, which may have regressed if Julia generates worse code now). If you use the occupancy API, you shouldn't ever run into this.

pxl-th · 2022-10-26T12:59:18Z

Reducing number of threads does seem to help, but I should probably look into occupancy API as you've suggested.

However, on AMDGPU reducing number of threads (even setting as low as 1) only seems to delay when the similar error occurs. There it progressively trims the maximum number of concurrent waves to allow scratch to fit until the error occurs.

And all that comes with performance hit anyway.
So I'll try to create reproducer and report upstream as you've suggested. Thanks!

pxl-th · 2022-10-27T09:42:30Z

It's interesting that this happens after a couple of iterations

Just a follow-up, it happens after few iterations, because I start passing new variable which was nothing for the first several steps. So it forces recompilation.

maleadt · 2022-10-27T10:09:52Z

Do you have a CPU-based MWE (i.e. just doing code_llvm with minimal dependencies)? I have some bisection infrastructure ready, so could give that a go.

pxl-th · 2022-10-27T10:55:11Z

Not really... On CPU @code_llvm looks very similar between 1.8 and 1.9...

Smallest I've got is this:

using StaticArrays
using CUDA

function f(x)
    width, height = size(x)
    xy = SVector{2, Float32}(0.5f0, 0.5f0)
    res = SVector{2, UInt32}(width, height)
    floor.(UInt32, max.(0f0, xy) .* res)
    nothing
end

function main()
    x = CUDA.ones(Float32, (8, 8))
    @cuda threads=1 f(x)
end
main()

Also if you replace width and height with hardcoded values, the issue disappears.

maleadt · 2022-10-27T11:22:51Z

Hmm yes, this does seem limited to CUDA.jl (or probably, GPUCompiler.jl).

maleadt · 2022-10-27T11:35:10Z

Opened an issue on GPUCompiler: #366

pxl-th · 2022-11-14T20:04:56Z

@maleadt, RE ERROR_LAUNCH_OUT_OF_RESOURCES, I've updated GPUCompiler to #master and Julia to #master as well and still see this issue (the issue with invalid IR is gone though).

How would you suggest I debug what's causing the issue?
I've looked at the output from @device_code but I'm not used to it, so that didn't help much...
It looks similar between master and 1.8.2, but I may be missing something.

maleadt · 2022-11-14T20:18:47Z

That is a separate issue. Can you create an MWE?

To debug this, you can try introspecting the kernel: get a hold of the kernel object, and call e.g. CUDA.registers on it to see if the amount of registers it requires has regressed.

But again, you should be using the occupancy API to be resilient against changes like this (so that your application doesn't crash, at least), also because you want your kernels to be generic and e.g. support different element types (which may result in generated code that requires more registers).

pxl-th · 2022-11-14T22:15:43Z

Can you create an MWE?

I'll try.

call e.g. CUDA.registers

Output on #master:

CUDA.registers(kerr) = 157
CUDA.memory(kerr) = (local = 1224, shared = 0, constant = 0)
CUDA.maxthreads(kerr) = 384

vs 1.8.2:

CUDA.registers(kerr) = 122
CUDA.memory(kerr) = (local = 1224, shared = 0, constant = 0)
CUDA.maxthreads(kerr) = 512

maleadt · 2022-11-15T06:54:17Z

Yeah, that's a regression. Could you file that as an issue on CUDA.jl (i.e., the MWE calling CUDA.registers)?

pxl-th · 2022-11-15T11:52:56Z

Opened JuliaGPU/CUDA.jl#1673.

pxl-th closed this as completed Oct 26, 2022

maleadt reopened this Oct 27, 2022

maleadt closed this as completed Oct 27, 2022

maleadt mentioned this issue Nov 7, 2022

Disable semi-concrete interpretation, for now. #369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid IR with StaticArrays.jl #363

Invalid IR with StaticArrays.jl #363

pxl-th commented Oct 26, 2022 •

edited

Loading

maleadt commented Oct 26, 2022

pxl-th commented Oct 26, 2022

maleadt commented Oct 26, 2022

pxl-th commented Oct 26, 2022 •

edited

Loading

pxl-th commented Oct 27, 2022

maleadt commented Oct 27, 2022

pxl-th commented Oct 27, 2022

maleadt commented Oct 27, 2022

maleadt commented Oct 27, 2022 •

edited by jpsamaroo

Loading

pxl-th commented Nov 14, 2022

maleadt commented Nov 14, 2022

pxl-th commented Nov 14, 2022 •

edited

Loading

maleadt commented Nov 15, 2022

pxl-th commented Nov 15, 2022

Invalid IR with StaticArrays.jl #363

Invalid IR with StaticArrays.jl #363

Comments

pxl-th commented Oct 26, 2022 • edited Loading

maleadt commented Oct 26, 2022

pxl-th commented Oct 26, 2022

maleadt commented Oct 26, 2022

pxl-th commented Oct 26, 2022 • edited Loading

pxl-th commented Oct 27, 2022

maleadt commented Oct 27, 2022

pxl-th commented Oct 27, 2022

maleadt commented Oct 27, 2022

maleadt commented Oct 27, 2022 • edited by jpsamaroo Loading

pxl-th commented Nov 14, 2022

maleadt commented Nov 14, 2022

pxl-th commented Nov 14, 2022 • edited Loading

maleadt commented Nov 15, 2022

pxl-th commented Nov 15, 2022

pxl-th commented Oct 26, 2022 •

edited

Loading

pxl-th commented Oct 26, 2022 •

edited

Loading

maleadt commented Oct 27, 2022 •

edited by jpsamaroo

Loading

pxl-th commented Nov 14, 2022 •

edited

Loading