-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking some very simple Flux models #2069
Comments
Trying this on another computer, with Julia 1.11, I see similar slowdown on the small model, and a failure on the larger one. julia> @btime $mlp($img);
173.251 μs (13 allocations: 42.36 KiB)
julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);
494.602 μs (69 allocations: 588.97 KiB)
julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);
884.058 μs (91 allocations: 586.92 KiB)
# Larger model fails:
julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
ERROR:
No create nofree of empty function (julia.gc_loaded) julia.gc_loaded)
at context: call fastcc void @julia__PoolDims_14_107488({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }* noalias nocapture nofree noundef nonnull writeonly sret({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }) align 8 dereferenceable(104) %5, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(64) %35, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(96) %34, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %44, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(112) %36) #268, !dbg !297 (julia__PoolDims_14_107488)
Stacktrace:
[1] PoolDims
@ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20
[2] PoolDims
@ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43
[3] MaxPool
@ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728
[4] macro expansion
@ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
[5] _applychain
@ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
Stacktrace:
[1] PoolDims
@ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20 [inlined]
[2] PoolDims
@ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43 [inlined]
[3] MaxPool
@ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728 [inlined]
[4] macro expansion
@ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 [inlined]
[5] _applychain
@ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53
[6] Chain
@ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:51 [inlined]
[7] #19
@ ./REPL[31]:1 [inlined]
[8] diffejulia__19_105996_inner_242wrap
@ ./REPL[31]:0
[9] macro expansion
@ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined]
[10] enzyme_call
@ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined]
[11] CombinedAdjointThunk
@ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7641 [inlined]
[12] autodiff
@ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:491 [inlined]
[13] autodiff
@ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:512 [inlined]
[14] macro expansion
@ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1678 [inlined]
[15] gradient(rm::ReverseMode{…}, f::var"#19#20", x::Chain{…}, args::Array{…})
@ Enzyme ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1661
[16] top-level scope
@ REPL[31]:1
Some type information was truncated. Use `show(err)` to see complete types.
(jl_KEzUxT) pkg> st Enzyme
Status `/tmp/jl_KEzUxT/Project.toml`
[7da242da] Enzyme v0.13.14
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, broadwell)
Threads: 4 default, 0 interactive, 2 GC (on 12 virtual cores)
Environment:
JULIA_NUM_THREADS = 4 |
Sorry finally getting around to this. So for the first case, I don't see that much of a gap (though definitely it would be good to improve):
|
Surprised how different those numbers are. I realised I have AppleAccelerate loaded, and if I run with julia> @btime $mlp($img);
min 104.833 μs, mean 109.179 μs (6 allocations, 43.09 KiB)
julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img); # Zygote, allocating
min 243.792 μs, mean 305.012 μs (84 allocations, 596.17 KiB)
julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img); # allocating
min 266.292 μs, mean 329.010 μs (55 allocations, 579.61 KiB)
julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(mlp, Enzyme.make_zero(mlp))), $(Duplicated(img, Enzyme.make_zero(img)))); # pre-allocated
min 256.916 μs, mean 270.453 μs (11 allocations, 86.16 KiB) (Same machine & versions as above.) |
huh, so what exactly causes it to be slow. AppleAccelerate itself? |
Don't know. For the other model, changing to OpenBlas gives a slightly larger time-difference instead. (And a slightly smaller ratio). julia> @btime $lenet($img); # was min 655.583 μs, mean 1.107 ms with AppleAccelerate above
min 839.916 μs, mean 1.910 ms (160 allocations, 5.60 MiB)
julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
min 7.980 ms, mean 9.273 ms (556 allocations, 14.18 MiB)
julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
min 11.960 ms, mean 13.037 ms (538 allocations, 15.42 MiB)
julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(lenet, Enzyme.make_zero(lenet))), $(Duplicated(img, Enzyme.make_zero(img))));
min 12.017 ms, mean 13.615 ms (415 allocations, 14.85 MiB) The times here: #2069 (comment) on a different computer also don't involve AppleAccelerate. |
On some extremely simple Flux models, Enzyme seems to be slower than Zygote for me. What's going wrong here?
Versions:
The text was updated successfully, but these errors were encountered: