Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown when running multiple large models in parallel #1806

Closed
sash-a opened this issue Dec 10, 2021 · 6 comments
Closed

Slowdown when running multiple large models in parallel #1806

sash-a opened this issue Dec 10, 2021 · 6 comments

Comments

@sash-a
Copy link

sash-a commented Dec 10, 2021

When running multiple models in parallel (julia native threads) at the same time I see a change in the scaling as the models get larger. The slowdown occurs somewhere between a model size of 6000 to 70 000 parameters.

My initial though is that this is because copying the larger model across each thread outweighs the speedup from parallelism, but I've tried this with repeats as high as 3000 and the inner loop of testflux as high as 1000 and I see the same curve, so I don't think that is the case. I've also tried this using Flux.destructure and putting the parameters in a SharedArray (thus only copying the recostructor method across the threads) before benchmarking and I again saw similar slowdowns.

using Flux
using Random
using Base.Threads
using BenchmarkTools
using Future

using SharedArrays

function testflux(repeats, nns, rngs)
    @threads for i in 1:repeats
        tid = threadid()
        rng = rngs[tid]
        nn = nns[tid]
        for _ in 1:100
            nn(rand(rng, 5))
        end
    end
end

# https://discourse.julialang.org/t/random-number-and-parallel-execution/57024
function parallel_rngs(rng::MersenneTwister, n::Integer)
    step = big(10)^20
    rngs = Vector{MersenneTwister}(undef, n)
    rngs[1] = copy(rng)
    for i = 2:n
        rngs[i] = Future.randjump(rngs[i-1], step)  # TODO step each by `procrank`
    end
    return rngs
end

function main()
    nns = ntuple(_ -> Chain(Dense(5, 256, tanh),
                            Dense(256, 256, tanh),
                            Dense(256, 5, tanh)), Threads.nthreads())
    for nn in nns  # warm up
        nn(rand(5))
    end

    mt = MersenneTwister()
    rngs = parallel_rngs(mt, Threads.nthreads())

    @show sum(length, Flux.params(first(nns)))
    repeats = 256
    @btime testflux($repeats, $nns, $rngs)
end

main()

Here's the graph
lrg: 68613 params
med: 5733 params
sml: 453 params

med and small see pretty close to linear performance, but lrg just doesn't seem to scale at all, so I'm not sure what the problem is?

image

x axis: number of threads
y axis: speedup (serial time / time taken)

Julia 1.6:

pkg> status
     Project ScalableES v0.1.0
      Status `~/Documents/es/ScalableES/Project.toml`
  [c7e460c6] ArgParse v1.1.4
  [fbb218c0] BSON v0.3.3
  [6e4b80f9] BenchmarkTools v0.5.0
  [864edb3b] DataStructures v0.18.10
  [b4f34e82] Distances v0.8.2
  [31c24e10] Distributions v0.23.12
  [587475ba] Flux v0.11.3
  [b8da5b23] HrlMuJoCoEnvs v0.1.0 `~/Documents/es/HrlMuJoCoEnvs`
  [c8e1da08] IterTools v1.3.0
  [6917c76a] LyceumAI v0.2.3
  [db31fed1] LyceumBase v0.2.2
  [48b9757e] LyceumMuJoCo v0.2.3
  [409fd311] LyceumMuJoCoViz v0.2.5
  [da04e1cc] MPI v0.19.0
  [93189219] MuJoCo v0.3.0
  [91a5bcdd] Plots v1.20.0
  [90137ffa] StaticArrays v0.12.5
  [2913bbd2] StatsBase v0.33.9
  [899adc3e] TensorBoardLogger v0.1.18
  [b189fb0b] ThreadPools v2.1.0
  [5c5e3362] UniversalLogger v0.2.0
  [ade2ca70] Dates
  [9fa8497b] Future
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [9a3f8284] Random
  [1a1011a3] SharedArrays
  [10745b16] Statistics

If anyone has some insight into why this is happening it would be much appreciated

@sash-a sash-a changed the title Slowdown when running larger multiple models in parallel Slowdown when running multiple large models in parallel Dec 10, 2021
@DhairyaLGandhi
Copy link
Member

One comment i would have is to first removing the allocation cost to see how the time is being spent. Next is performing a simple matmul to see whether switching out OpenBLAS to MKL would be better, and third finding out the GC times over extended runs. Note too, that calling the NNs in a threaded loop is probably not best. Try it so that the work is scheduled using the new scheduler.

I would also repeat this with the backwards pass.

@sash-a
Copy link
Author

sash-a commented Dec 10, 2021

removing the allocation cost

Not quite following this? Do you mean I shouldn't store the nn variable and keep accessing via nns[tid]?

performing a simple matmul to see whether switching out OpenBLAS to MKL would be better

Do I need to do anything special to switch to BLAS or MKL? Or is it as simple as switching the nn(rand(rng, 5)) call to A * rand(rng, 5)?

calling the NNs in a threaded loop is probably not best

Two things, can you give me a link to the docs for julia's new scheduler? But shouldn't a scheduler not make a difference here, this is very much an embarrassingly parallel problem?

@ToucheSir
Copy link
Member

Julia threads use shared memory, so no copying should be required.

Do I need to do anything special to switch to BLAS or MKL?

If you're running Julia 1.7, testing with MKL is just an ] add away: https://github.com/JuliaLinearAlgebra/MKL.jl#to-install

As for the question about slowdowns, note that your @threads loop will be competing with BLAS (which is the generic name we give to the library that handles a lot of linear algebra functionality, e.g. MKL provides an implementation of BLAS) because it uses multiple threads as well. You can alleviate this by reducing the number of threads BLAS uses for itself via BLAS.set_num_threads.

@sash-a
Copy link
Author

sash-a commented Dec 10, 2021

your @threads loop will be competing with BLAS

I was just thinking this might be the case, was about to ask how to set it to use 1 thread, so thanks for that. I'll give that a test and see if it fixes the problem

@sash-a
Copy link
Author

sash-a commented Dec 10, 2021

BLAS competing for threads was the problem, thanks for the tip @ToucheSir!

Julia threads use shared memory, so no copying should be required.

I'm a bit confused by this, are they? Why do SharedArrays exist then?

Genuinely curious about this so would appreciate a response, but the issue is solved so closing it

@sash-a sash-a closed this as completed Dec 10, 2021
@ToucheSir
Copy link
Member

SharedArrays are for cross-process coordination. If you haven't seen them yet, this page has a decent rundown: https://docs.julialang.org/en/v1/manual/distributed-computing/#man-shared-arrays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants