-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow kernel compilation #65
Comments
I think I came across this while working on https://discourse.julialang.org/t/a-demo-is-1-5x-faster-in-flux-than-tensorflow-both-use-cpu-while-3-0x-slower-during-using-cuda/66703/. Most Flux optimisers (e.g. ADAM) use a set of fused broadcasts, and it seems compiling these can take a significant amount of time. |
Do you have a reproducer without all of Flux (or without having to use Flux at run time)? I.e., is there a specific kernel that compiles excessively slow, or is it just that we're compiling a lot of kernels somewhat slowly? |
I believe it's the latter, specifically fused broadcast kernels: using CUDA
function adam_combined!(Δ, mt, vt, β, ϵ, η)
@. mt = β[1] * mt + (1 - β[1]) * Δ
@. vt = β[1] * vt + (1 - β[2]) * Δ^2
@. Δ = mt / (1 - β[1]) / (√(vt / (1 - β[2])) + ϵ) * η
return mt, vt
end
# Ensure broadcast machinery is already compiled`
let a = CUDA.rand(10), b = CUDA.rand(10)
@. a = a * b / a + b
end
let Δ = CUDA.rand(10, 10), mt = CUDA.zeros(10, 10), vt = CUDA.zeros(10, 10), β = (0.9, 0.999)
CUDA.GPUCompiler.enable_timings()
@time CUDA.@sync adam_combined!(Δ, mt, vt, β, Float32(1e-8), 0.001f0)
CUDA.GPUCompiler.timings()
end Which outputs:
I'm not sure why there's such a large gap between the total runtime and tracked time from GPUCompiler. The time between I'm working on capturing a flamegraph for the host-side operations, will update with that shortly. Edit: see https://gist.github.com/ToucheSir/43077d99e2318004c175cd5ba1346ea7. |
Lots of inference time, it seems? |
That, and the third broadcast spends the bulk of its time running the LLVM pass pipeline? I guess I'm confused by why both would be so much slower for the GPU path compared to the CPU one. |
CPU compilation uses Julia's native code and inference caches, which are already "hot" from the sysimage and precompile data. GPU compilation through GPUCompiler essentially starts with empty caches, requiring reinferring and recompiling "the world" (every function called or callable during GPU kernel execution). |
Is that for every new kernel or just the first one? I tried running a broadcasted operation before profiling in the hopes of having as much as possible "warm", but maybe that wasn't enough? I also just tried turning each broadcast into its own kernel (with synchronization in between) and that runs in less time (1.5s vs 2.4s) than the broadcasted version. Does compiling fused broadcasts also add significant overhead somewhere? |
Inference of broadcast is significantly harder.
Every initial compilation of a new function (e.g. a broadcast with a new closure) will require some inference. You can try measuring after cleaning some of the caches: There may be an additional source of latency: initial compilation of the Julia compiler itself. Since we use a custom AbstractInterpreter, it's possible a lot of code there is recompiled for the first kernel, but not for subsequent compilations (i.e. wiping the above caches shouldn't affect that). |
Thanks for the explanation. For Flux, I guess that leaves us with either adding dispatches to native kernels/KernelAbstractions kernels on GPU to ameliorate the overhead of multiple broadcasts, or waiting for as-yet-unknown upstream compiler improvements to smooth things out. |
FWIW, GPU compilation of simple kernels is pretty OK:
With some of the more complicated broadcasts: julia> @eval GPUArrays begin
function broadcast_kernel(ctx, dest, bc′, nelem)
for i in 1:nelem
I = @cartesianidx(dest, i)
@inbounds dest[I] = bc′[I]
end
return
end
end
# I couldn't figure out how to use the anonymous function defined in GPUArrays, the only function, #broadcast_kernel#17#, didn't have any methods
julia> GPUCompiler.enable_timings()
julia> @benchmark CUDA.code_ptx(devnull, GPUArrays.broadcast_kernel, Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(sqrt), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(/), Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Int64, Float64}}}}}}, Float32}}}}, Float32}}, Int64})
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 444.470 ms … 474.020 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 449.698 ms ┊ GC (median): 0.00%
Time (mean ± σ): 452.559 ms ± 10.287 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
473 ms Histogram: frequency by time 453 ms <
Memory estimate: 1.21 MiB, allocs estimate: 18680.
julia> GPUCompiler.timings()
────────────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 882s / 1.49% 6.21GiB / 0.69%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────
LLVM middle-end 29 12.7s 96.8% 439ms 14.9MiB 34.2% 528KiB
optimization 29 11.5s 87.5% 396ms 401KiB 0.90% 13.8KiB
target libraries 29 1.09s 8.28% 37.5ms 13.1MiB 29.9% 462KiB
clean-up 29 93.8ms 0.71% 3.23ms 34.0KiB 0.08% 1.17KiB
runtime library 29 33.6ms 0.26% 1.16ms 4.53KiB 0.01% 160B
IR generation 29 294ms 2.24% 10.1ms 28.5MiB 65.2% 0.98MiB
rewrite 29 127ms 0.96% 4.37ms 11.5MiB 26.3% 406KiB
lower throw 29 2.67ms 0.02% 92.2μs 1.86MiB 4.26% 65.8KiB
hide trap 29 411μs 0.00% 14.2μs 155KiB 0.35% 5.33KiB
emission 29 108ms 0.82% 3.71ms 16.6MiB 38.1% 587KiB
clean-up 29 614μs 0.00% 21.2μs 86.5KiB 0.19% 2.98KiB
LLVM back-end 29 121ms 0.92% 4.19ms 240KiB 0.54% 8.26KiB
machine-code generation 29 121ms 0.92% 4.16ms 212KiB 0.47% 7.31KiB
preparation 29 813μs 0.01% 28.0μs 26.1KiB 0.06% 920B
Julia front-end 29 362μs 0.00% 12.5μs 17.7KiB 0.04% 624B
validation 29 301μs 0.00% 10.4μs 17.7KiB 0.04% 624B
strip debug info 29 123μs 0.00% 4.24μs 0.00B 0.00% 0.00B
──────────────────────────────────────────────────────────────────────────────────── With 85% spent during LLVM optimization, I'm not sure we can do much better. Also interesting; dropping the inference cache before every compilation adds about 20%:
Again, all this is different from the time-to-first-kernel, which includes a lot of time (re)compiling the Julia and GPU compilers itself. |
That sounds like something we could precompile? |
You can't precompile another module's methods, right? I thought there was a Base issue for this, can't find it immediately. |
I thought you can as long as you own one of the types? But yes arcane rules |
IIRC it used to be 200us or so, and @vtjnash mentioned Base compilation taking
1us1ms or so for the simplest functions. This is pretty slow then:The text was updated successfully, but these errors were encountered: