Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid IR with StaticArrays.jl #363

Closed
pxl-th opened this issue Oct 26, 2022 · 14 comments
Closed

Invalid IR with StaticArrays.jl #363

pxl-th opened this issue Oct 26, 2022 · 14 comments

Comments

@pxl-th
Copy link
Member

pxl-th commented Oct 26, 2022

Hi! Not really sure where to post this issue, but following kernel produces InvalidIRError on Julia 1.9 master, while on 1.8.2 it works fine:

using StaticArrays
using CUDA
using CUDAKernels
using KernelAbstractions

@inline function to_image_pos(xy::SVector{2, Float32}, resolution::SVector{2, UInt32})
    xy_res = floor.(UInt32, max.(0f0, xy) .* resolution) # <-- this line causes issues
    min.(resolution .- 1, xy_res) .+ 1
end

@kernel function f(y, x)
    i = @index(Global)
    width, height = size(x)
    pixel = to_image_pos(SVector{2, Float32}(0.5f0, 0.5f0), SVector{2, UInt32}(width, height))
    y[i] = sum(pixel)
end

function main()
    dev = CUDADevice()
    x = CUDA.ones(Float32, (8, 8))
    y = CUDA.zeros(Float32, 4)
    wait(f(dev, 128)(y, x; ndrange=4))
end
main()

However, if I split xy_res calculation into several steps, it works fine:

@inline function to_image_pos(xy::SVector{2, Float32}, resolution::SVector{2, UInt32})
    a = max.(0f0, xy)
    b = a .* resolution
    xy_res = floor.(UInt32, b)
    min.(resolution .- 1, xy_res) .+ 1
end

Error:

ERROR: InvalidIRError: compiling kernel #gpu_f(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CUDA.CuDeviceVector{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to floor)
Stacktrace:
  [1] newf
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:185
  [2] macro expansion
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:140
  [3] __broadcast
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:128
  [4] _broadcast
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:124
  [5] copy
    @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
  [6] materialize
    @ ./broadcast.jl:873
  [7] to_image_pos
    @ ~/code/INGP.jl/src/nerf/samples.jl:179
  [8] macro expansion
    @ ~/code/INGP.jl/src/INGP.jl:187
  [9] gpu_f
    @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [10] gpu_f
    @ ./none:0
Reason: unsupported call to an unknown function (call to jl_f_tuple)
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:140
 [2] __broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:128
 [3] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:124
 [4] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [5] materialize
   @ ./broadcast.jl:873
 [6] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [7] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [8] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [9] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to eltype)
Stacktrace:
 [1] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:125
 [2] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [3] materialize
   @ ./broadcast.jl:873
 [4] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [5] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [6] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [7] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to similar_type)
Stacktrace:
 [1] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:125
 [2] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [3] materialize
   @ ./broadcast.jl:873
 [4] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [5] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [6] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [7] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] _broadcast
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:125
 [2] copy
   @ ~/.julia/packages/StaticArrays/PUoe1/src/broadcast.jl:62
 [3] materialize
   @ ./broadcast.jl:873
 [4] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:179
 [5] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [6] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [7] gpu_f
   @ ./none:0
Reason: unsupported call to an unknown function (call to jl_f_tuple)
Stacktrace:
 [1] broadcasted
   @ ./broadcast.jl:1319
 [2] broadcasted
   @ ./broadcast.jl:1317
 [3] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:180
 [4] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [5] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [6] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to Base.Broadcast.Broadcasted{StaticArraysCore.StaticArrayStyle{1}})
Stacktrace:
 [1] broadcasted
   @ ./broadcast.jl:1319
 [2] broadcasted
   @ ./broadcast.jl:1317
 [3] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:180
 [4] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [5] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [6] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to materialize)
Stacktrace:
 [1] to_image_pos
   @ ~/code/INGP.jl/src/nerf/samples.jl:180
 [2] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:187
 [3] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [4] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to sum)
Stacktrace:
 [1] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:188
 [2] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [3] gpu_f
   @ ./none:0
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/DfvRa/src/device/array.jl:194
 [2] macro expansion
   @ ~/code/INGP.jl/src/INGP.jl:188
 [3] gpu_f
   @ ~/.julia/packages/KernelAbstractions/DqITC/src/macros.jl:81
 [4] gpu_f
   @ ./none:0
@maleadt
Copy link
Member

maleadt commented Oct 26, 2022

Probably not something we can do much about in GPUCompiler, as this seems like a Julia 'regression' (and not even strictly so, because Julia is a dynamic language, so not all code is expected to compile statically). Especially the floor broadcast in there is far from guaranteed to compile statically.

If you care about this pattern, what I'd recommend you to do is to create a reproducer without GPUCompiler that emits static code with plain code_llvm on 1.8, but doesn't on 1.9; try to bisect Julia to the offending change; and file that as an upstream issue. It's possible that we should have to care about that regression, since broadcast shouldn't generally regress (assuming that the compilation regression reported here also manifests as a performance regression).

@pxl-th
Copy link
Member Author

pxl-th commented Oct 26, 2022

Thanks for suggestions!
I think there were indeed some regressions (at least with regard to StaticArrays.jl), because besides this issue, on Julia 1.9 I get

CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)

after several iterations (< 10) in my code.
While on 1.8.2 the same code runs fine for hundreds or thousands of iterations.

@maleadt
Copy link
Member

maleadt commented Oct 26, 2022

It's interesting that this happens after a couple of iterations, as ERROR_LAUNCH_OUT_OF_RESOURCES should happen when launching a kernel and does AFAIK not depend on the state of the GPU (so it should happen on the first iteration). Generally it's a user error, related to the number of threads you choose to launch, which depends on the complexity of the kernel (e.g. number of registers, which may have regressed if Julia generates worse code now). If you use the occupancy API, you shouldn't ever run into this.

@pxl-th
Copy link
Member Author

pxl-th commented Oct 26, 2022

Reducing number of threads does seem to help, but I should probably look into occupancy API as you've suggested.

However, on AMDGPU reducing number of threads (even setting as low as 1) only seems to delay when the similar error occurs. There it progressively trims the maximum number of concurrent waves to allow scratch to fit until the error occurs.

And all that comes with performance hit anyway.
So I'll try to create reproducer and report upstream as you've suggested. Thanks!

@pxl-th pxl-th closed this as completed Oct 26, 2022
@pxl-th
Copy link
Member Author

pxl-th commented Oct 27, 2022

It's interesting that this happens after a couple of iterations

Just a follow-up, it happens after few iterations, because I start passing new variable which was nothing for the first several steps. So it forces recompilation.

@maleadt
Copy link
Member

maleadt commented Oct 27, 2022

Do you have a CPU-based MWE (i.e. just doing code_llvm with minimal dependencies)? I have some bisection infrastructure ready, so could give that a go.

@pxl-th
Copy link
Member Author

pxl-th commented Oct 27, 2022

Not really... On CPU @code_llvm looks very similar between 1.8 and 1.9...

Smallest I've got is this:

using StaticArrays
using CUDA

function f(x)
    width, height = size(x)
    xy = SVector{2, Float32}(0.5f0, 0.5f0)
    res = SVector{2, UInt32}(width, height)
    floor.(UInt32, max.(0f0, xy) .* res)
    nothing
end

function main()
    x = CUDA.ones(Float32, (8, 8))
    @cuda threads=1 f(x)
end
main()

Also if you replace width and height with hardcoded values, the issue disappears.

@maleadt
Copy link
Member

maleadt commented Oct 27, 2022

Hmm yes, this does seem limited to CUDA.jl (or probably, GPUCompiler.jl).

@maleadt maleadt reopened this Oct 27, 2022
@maleadt
Copy link
Member

maleadt commented Oct 27, 2022

Opened an issue on GPUCompiler: #366

@pxl-th
Copy link
Member Author

pxl-th commented Nov 14, 2022

@maleadt, RE ERROR_LAUNCH_OUT_OF_RESOURCES, I've updated GPUCompiler to #master and Julia to #master as well and still see this issue (the issue with invalid IR is gone though).

How would you suggest I debug what's causing the issue?
I've looked at the output from @device_code but I'm not used to it, so that didn't help much...
It looks similar between master and 1.8.2, but I may be missing something.

@maleadt
Copy link
Member

maleadt commented Nov 14, 2022

That is a separate issue. Can you create an MWE?

To debug this, you can try introspecting the kernel: get a hold of the kernel object, and call e.g. CUDA.registers on it to see if the amount of registers it requires has regressed.

But again, you should be using the occupancy API to be resilient against changes like this (so that your application doesn't crash, at least), also because you want your kernels to be generic and e.g. support different element types (which may result in generated code that requires more registers).

@pxl-th
Copy link
Member Author

pxl-th commented Nov 14, 2022

Can you create an MWE?

I'll try.

call e.g. CUDA.registers

Output on #master:

CUDA.registers(kerr) = 157
CUDA.memory(kerr) = (local = 1224, shared = 0, constant = 0)
CUDA.maxthreads(kerr) = 384

vs 1.8.2:

CUDA.registers(kerr) = 122
CUDA.memory(kerr) = (local = 1224, shared = 0, constant = 0)
CUDA.maxthreads(kerr) = 512

@maleadt
Copy link
Member

maleadt commented Nov 15, 2022

Yeah, that's a regression. Could you file that as an issue on CUDA.jl (i.e., the MWE calling CUDA.registers)?

@pxl-th
Copy link
Member Author

pxl-th commented Nov 15, 2022

Opened JuliaGPU/CUDA.jl#1673.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants