Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible huge performance regression in batched_mul #282

Closed
bjarthur opened this issue Feb 26, 2021 · 7 comments
Closed

possible huge performance regression in batched_mul #282

bjarthur opened this issue Feb 26, 2021 · 7 comments

Comments

@bjarthur
Copy link
Contributor

i'm still looking into the cause and quantifying the magnitude of the effect, but i think it is caused by #271. has anyone else observed this?

@DhairyaLGandhi
Copy link
Member

Can you provide an MWE?

@bjarthur
Copy link
Contributor Author

bjarthur commented Mar 7, 2021

here's an MWE showing a 1000x performance regression for batched_mul when one of the inputs is of type NNlib.BatchedTranspose. i've bisected it to somewhere between [email protected] and [email protected].

julia> VERSION
v"1.5.0"

(test0712) pkg> st
Status `~/projects/darshan/trainBalancedNet-ver2-GPU-forchris/Project.toml`
  [052768ef] CUDA v2.4.1
  [872c559c] NNlib v0.7.12 ⚲

julia> using CUDA, NNlib, BenchmarkTools

julia> x=CUDA.rand(Float32, 64,1,5000);

julia> xT=batched_transpose(x);

julia> y=CUDA.rand(Float32, 64,1,5000);

julia> @benchmark CUDA.@sync batched_mul(xT,y)
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  11
  --------------
  minimum time:     26.261 μs (0.00% GC)
  median time:      49.789 μs (0.00% GC)
  mean time:        50.305 μs (0.00% GC)
  maximum time:     1.174 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> xT2=CUDA.rand(Float32, 1,64,5000);

julia> @benchmark CUDA.@sync batched_mul(xT2,y)
BenchmarkTools.Trial: 
  memory estimate:  176 bytes
  allocs estimate:  7
  --------------
  minimum time:     34.175 μs (0.00% GC)
  median time:      54.682 μs (0.00% GC)
  mean time:        58.268 μs (0.00% GC)
  maximum time:     26.206 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
julia> VERSION
v"1.5.0"

(test0713) pkg> st
Status `~/projects/darshan/trainBalancedNet-ver2-gpu/Project.toml`
  [052768ef] CUDA v2.4.1
  [872c559c] NNlib v0.7.13 ⚲

julia> using CUDA, NNlib, BenchmarkTools

julia> x=CUDA.rand(Float32, 64,1,5000);

julia> xT=batched_transpose(x);

julia> y=CUDA.rand(Float32, 64,1,5000);

julia> @benchmark CUDA.@sync batched_mul(xT,y)
BenchmarkTools.Trial: 
  memory estimate:  2.52 MiB
  allocs estimate:  110008
  --------------
  minimum time:     39.842 ms (0.00% GC)
  median time:      39.991 ms (0.00% GC)
  mean time:        43.148 ms (2.04% GC)
  maximum time:     98.322 ms (17.06% GC)
  --------------
  samples:          116
  evals/sample:     1

julia> xT2=CUDA.rand(Float32, 1,64,5000);

julia> @benchmark CUDA.@sync batched_mul(xT2,y)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  9
  --------------
  minimum time:     34.570 μs (0.00% GC)
  median time:      54.919 μs (0.00% GC)
  mean time:        57.578 μs (0.00% GC)
  maximum time:     21.103 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

@CarloLucibello
Copy link
Member

cc @mcabbott

@mcabbott
Copy link
Member

mcabbott commented Mar 9, 2021

Oh that's not good. I can't try this out now, but the obvious guess is that it's using the generic fallback (which makes slices) rather than a CUBLAS call. If you run ENV["JULIA_DEBUG"] = NNlib before starting this, does it print anything about that? (There are some @debug statements, this may need to be done in a fresh session.) What does NNlib. storage_typejoin(xT,y) return?

Apart from that... does it return a CuArray at all? Does this also happen with say x=CUDA.rand(Float32, 64,3,5000); etc. to avoid simple dot case?

@mcabbott
Copy link
Member

More compact demonstration:

using NNlib
ENV["JULIA_DEBUG"] = NNlib

x, y = randn(64,3,5), randn(64,7,5); # fine
xT = batched_transpose(x); xT2 = copy(xT);
batched_mul(xT,y)  batched_mul(xT2,y)

x, y = randn(64,1,5), randn(64,7,5); # uses fallback
xT = batched_transpose(x); xT2 = copy(xT);
batched_mul(xT,y)  batched_mul(xT2,y)  # still correct

Test all cases, from #268: quite a few use fallback.

using NNlib, Test
@testset "awkward strides, $T" for T in [Float64] #, ComplexF64]
    @testset "$tA(rand$((sA...,2))) ⊠ $tB(rand$((sB...,2)))" for tA in [identity, batched_adjoint, batched_transpose], sA in [(1,3), (3,1), (3,3)], tB in [identity, batched_adjoint, batched_transpose], sB in [(1,3), (3,1), (3,3)]
        A = tA(rand(T, sA..., 2))
        B = tB(rand(T, sB..., 2))
        size(A,2) == size(B,1) || continue
        C = cat(A[:,:,1] * B[:,:,1], A[:,:,2] * B[:,:,2]; dims=3)
        @test A  B  C
    end
end;

@bjarthur
Copy link
Contributor Author

thanks for looking into this!

would it be worth adding automatic performance regression testing to NNlib's CI like exists for julia base?

@mcabbott
Copy link
Member

It might be, although tricky. There are many cases to test, and I'm not sure this would be a huge slowdown on the CPU, where the fast path isn't so different from the fallback.

Now #299 at least tests whether @debug prints anything, in a more exhaustive version of the loop just above. This tests a few hundred different cases; this one (batched_transpose(rand(n,1,b))) was among the edge cases tested for correctness, but there were no explicit tests of which routine is called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants