possible huge performance regression in `batched_mul` #282

bjarthur · 2021-02-26T20:51:18Z

i'm still looking into the cause and quantifying the magnitude of the effect, but i think it is caused by #271. has anyone else observed this?

DhairyaLGandhi · 2021-02-26T20:57:16Z

Can you provide an MWE?

bjarthur · 2021-03-07T14:38:57Z

here's an MWE showing a 1000x performance regression for batched_mul when one of the inputs is of type NNlib.BatchedTranspose. i've bisected it to somewhere between [email protected] and [email protected].

julia> VERSION
v"1.5.0"

(test0712) pkg> st
Status `~/projects/darshan/trainBalancedNet-ver2-GPU-forchris/Project.toml`
  [052768ef] CUDA v2.4.1
  [872c559c] NNlib v0.7.12 ⚲

julia> using CUDA, NNlib, BenchmarkTools

julia> x=CUDA.rand(Float32, 64,1,5000);

julia> xT=batched_transpose(x);

julia> y=CUDA.rand(Float32, 64,1,5000);

julia> @benchmark CUDA.@sync batched_mul(xT,y)
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  11
  --------------
  minimum time:     26.261 μs (0.00% GC)
  median time:      49.789 μs (0.00% GC)
  mean time:        50.305 μs (0.00% GC)
  maximum time:     1.174 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> xT2=CUDA.rand(Float32, 1,64,5000);

julia> @benchmark CUDA.@sync batched_mul(xT2,y)
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  7
  --------------
  minimum time:     34.175 μs (0.00% GC)
  median time:      54.682 μs (0.00% GC)
  mean time:        58.268 μs (0.00% GC)
  maximum time:     26.206 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> VERSION
v"1.5.0"

(test0713) pkg> st
Status `~/projects/darshan/trainBalancedNet-ver2-gpu/Project.toml`
  [052768ef] CUDA v2.4.1
  [872c559c] NNlib v0.7.13 ⚲

julia> using CUDA, NNlib, BenchmarkTools

julia> x=CUDA.rand(Float32, 64,1,5000);

julia> xT=batched_transpose(x);

julia> y=CUDA.rand(Float32, 64,1,5000);

julia> @benchmark CUDA.@sync batched_mul(xT,y)
BenchmarkTools.Trial: 
  memory estimate:  2.52 MiB
  allocs estimate:  110008
  --------------
  minimum time:     39.842 ms (0.00% GC)
  median time:      39.991 ms (0.00% GC)
  mean time:        43.148 ms (2.04% GC)
  maximum time:     98.322 ms (17.06% GC)
  --------------
  samples:          116
  evals/sample:     1

julia> xT2=CUDA.rand(Float32, 1,64,5000);

julia> @benchmark CUDA.@sync batched_mul(xT2,y)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  9
  --------------
  minimum time:     34.570 μs (0.00% GC)
  median time:      54.919 μs (0.00% GC)
  mean time:        57.578 μs (0.00% GC)
  maximum time:     21.103 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

CarloLucibello · 2021-03-08T14:52:58Z

cc @mcabbott

mcabbott · 2021-03-09T12:13:58Z

Oh that's not good. I can't try this out now, but the obvious guess is that it's using the generic fallback (which makes slices) rather than a CUBLAS call. If you run ENV["JULIA_DEBUG"] = NNlib before starting this, does it print anything about that? (There are some @debug statements, this may need to be done in a fresh session.) What does NNlib. storage_typejoin(xT,y) return?

Apart from that... does it return a CuArray at all? Does this also happen with say x=CUDA.rand(Float32, 64,3,5000); etc. to avoid simple dot case?

mcabbott · 2021-03-16T10:44:19Z

More compact demonstration:

using NNlib
ENV["JULIA_DEBUG"] = NNlib

x, y = randn(64,3,5), randn(64,7,5); # fine
xT = batched_transpose(x); xT2 = copy(xT);
batched_mul(xT,y) ≈ batched_mul(xT2,y)

x, y = randn(64,1,5), randn(64,7,5); # uses fallback
xT = batched_transpose(x); xT2 = copy(xT);
batched_mul(xT,y) ≈ batched_mul(xT2,y)  # still correct

Test all cases, from #268: quite a few use fallback.

using NNlib, Test
@testset "awkward strides, $T" for T in [Float64] #, ComplexF64]
    @testset "$tA(rand$((sA...,2))) ⊠ $tB(rand$((sB...,2)))" for tA in [identity, batched_adjoint, batched_transpose], sA in [(1,3), (3,1), (3,3)], tB in [identity, batched_adjoint, batched_transpose], sB in [(1,3), (3,1), (3,3)]
        A = tA(rand(T, sA..., 2))
        B = tB(rand(T, sB..., 2))
        size(A,2) == size(B,1) || continue
        C = cat(A[:,:,1] * B[:,:,1], A[:,:,2] * B[:,:,2]; dims=3)
        @test A ⊠ B ≈ C
    end
end;

bjarthur · 2021-03-16T11:47:49Z

thanks for looking into this!

would it be worth adding automatic performance regression testing to NNlib's CI like exists for julia base?

mcabbott · 2021-03-16T12:01:36Z

It might be, although tricky. There are many cases to test, and I'm not sure this would be a huge slowdown on the CPU, where the fast path isn't so different from the fallback.

Now #299 at least tests whether @debug prints anything, in a more exhaustive version of the loop just above. This tests a few hundred different cases; this one (batched_transpose(rand(n,1,b))) was among the edge cases tested for correctness, but there were no explicit tests of which routine is called.

CarloLucibello closed this as completed in aa5a6bd Mar 23, 2021

jondeuce mentioned this issue Jan 20, 2023

batched_vec >1000X slower than batched_mul #462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible huge performance regression in `batched_mul` #282

possible huge performance regression in `batched_mul` #282

bjarthur commented Feb 26, 2021

DhairyaLGandhi commented Feb 26, 2021

bjarthur commented Mar 7, 2021

CarloLucibello commented Mar 8, 2021

mcabbott commented Mar 9, 2021 •

edited

Loading

mcabbott commented Mar 16, 2021

bjarthur commented Mar 16, 2021

mcabbott commented Mar 16, 2021

possible huge performance regression in batched_mul #282

possible huge performance regression in batched_mul #282

Comments

bjarthur commented Feb 26, 2021

DhairyaLGandhi commented Feb 26, 2021

bjarthur commented Mar 7, 2021

CarloLucibello commented Mar 8, 2021

mcabbott commented Mar 9, 2021 • edited Loading

mcabbott commented Mar 16, 2021

bjarthur commented Mar 16, 2021

mcabbott commented Mar 16, 2021

possible huge performance regression in `batched_mul` #282

possible huge performance regression in `batched_mul` #282

mcabbott commented Mar 9, 2021 •

edited

Loading