Slowdown when running multiple large models in parallel #1806

sash-a · 2021-12-10T14:32:17Z

When running multiple models in parallel (julia native threads) at the same time I see a change in the scaling as the models get larger. The slowdown occurs somewhere between a model size of 6000 to 70 000 parameters.

My initial though is that this is because copying the larger model across each thread outweighs the speedup from parallelism, but I've tried this with repeats as high as 3000 and the inner loop of testflux as high as 1000 and I see the same curve, so I don't think that is the case. I've also tried this using Flux.destructure and putting the parameters in a SharedArray (thus only copying the recostructor method across the threads) before benchmarking and I again saw similar slowdowns.

using Flux
using Random
using Base.Threads
using BenchmarkTools
using Future

using SharedArrays

function testflux(repeats, nns, rngs)
    @threads for i in 1:repeats
        tid = threadid()
        rng = rngs[tid]
        nn = nns[tid]
        for _ in 1:100
            nn(rand(rng, 5))
        end
    end
end

# https://discourse.julialang.org/t/random-number-and-parallel-execution/57024
function parallel_rngs(rng::MersenneTwister, n::Integer)
    step = big(10)^20
    rngs = Vector{MersenneTwister}(undef, n)
    rngs[1] = copy(rng)
    for i = 2:n
        rngs[i] = Future.randjump(rngs[i-1], step)  # TODO step each by `procrank`
    end
    return rngs
end

function main()
    nns = ntuple(_ -> Chain(Dense(5, 256, tanh),
                            Dense(256, 256, tanh),
                            Dense(256, 5, tanh)), Threads.nthreads())
    for nn in nns  # warm up
        nn(rand(5))
    end

    mt = MersenneTwister()
    rngs = parallel_rngs(mt, Threads.nthreads())

    @show sum(length, Flux.params(first(nns)))
    repeats = 256
    @btime testflux($repeats, $nns, $rngs)
end

main()

Here's the graph
lrg: 68613 params
med: 5733 params
sml: 453 params

med and small see pretty close to linear performance, but lrg just doesn't seem to scale at all, so I'm not sure what the problem is?

x axis: number of threads
y axis: speedup (serial time / time taken)

Julia 1.6:

pkg> status
     Project ScalableES v0.1.0
      Status `~/Documents/es/ScalableES/Project.toml`
  [c7e460c6] ArgParse v1.1.4
  [fbb218c0] BSON v0.3.3
  [6e4b80f9] BenchmarkTools v0.5.0
  [864edb3b] DataStructures v0.18.10
  [b4f34e82] Distances v0.8.2
  [31c24e10] Distributions v0.23.12
  [587475ba] Flux v0.11.3
  [b8da5b23] HrlMuJoCoEnvs v0.1.0 `~/Documents/es/HrlMuJoCoEnvs`
  [c8e1da08] IterTools v1.3.0
  [6917c76a] LyceumAI v0.2.3
  [db31fed1] LyceumBase v0.2.2
  [48b9757e] LyceumMuJoCo v0.2.3
  [409fd311] LyceumMuJoCoViz v0.2.5
  [da04e1cc] MPI v0.19.0
  [93189219] MuJoCo v0.3.0
  [91a5bcdd] Plots v1.20.0
  [90137ffa] StaticArrays v0.12.5
  [2913bbd2] StatsBase v0.33.9
  [899adc3e] TensorBoardLogger v0.1.18
  [b189fb0b] ThreadPools v2.1.0
  [5c5e3362] UniversalLogger v0.2.0
  [ade2ca70] Dates
  [9fa8497b] Future
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [9a3f8284] Random
  [1a1011a3] SharedArrays
  [10745b16] Statistics

If anyone has some insight into why this is happening it would be much appreciated

The text was updated successfully, but these errors were encountered:

DhairyaLGandhi · 2021-12-10T14:54:41Z

One comment i would have is to first removing the allocation cost to see how the time is being spent. Next is performing a simple matmul to see whether switching out OpenBLAS to MKL would be better, and third finding out the GC times over extended runs. Note too, that calling the NNs in a threaded loop is probably not best. Try it so that the work is scheduled using the new scheduler.

I would also repeat this with the backwards pass.

sash-a · 2021-12-10T15:05:52Z

removing the allocation cost

Not quite following this? Do you mean I shouldn't store the nn variable and keep accessing via nns[tid]?

performing a simple matmul to see whether switching out OpenBLAS to MKL would be better

Do I need to do anything special to switch to BLAS or MKL? Or is it as simple as switching the nn(rand(rng, 5)) call to A * rand(rng, 5)?

calling the NNs in a threaded loop is probably not best

Two things, can you give me a link to the docs for julia's new scheduler? But shouldn't a scheduler not make a difference here, this is very much an embarrassingly parallel problem?

ToucheSir · 2021-12-10T19:03:54Z

Julia threads use shared memory, so no copying should be required.

Do I need to do anything special to switch to BLAS or MKL?

If you're running Julia 1.7, testing with MKL is just an ] add away: https://github.com/JuliaLinearAlgebra/MKL.jl#to-install

As for the question about slowdowns, note that your @threads loop will be competing with BLAS (which is the generic name we give to the library that handles a lot of linear algebra functionality, e.g. MKL provides an implementation of BLAS) because it uses multiple threads as well. You can alleviate this by reducing the number of threads BLAS uses for itself via BLAS.set_num_threads.

sash-a · 2021-12-10T19:35:28Z

your @threads loop will be competing with BLAS

I was just thinking this might be the case, was about to ask how to set it to use 1 thread, so thanks for that. I'll give that a test and see if it fixes the problem

sash-a · 2021-12-10T20:22:58Z

BLAS competing for threads was the problem, thanks for the tip @ToucheSir!

Julia threads use shared memory, so no copying should be required.

I'm a bit confused by this, are they? Why do SharedArrays exist then?

Genuinely curious about this so would appreciate a response, but the issue is solved so closing it

ToucheSir · 2021-12-10T20:29:09Z

SharedArrays are for cross-process coordination. If you haven't seen them yet, this page has a decent rundown: https://docs.julialang.org/en/v1/manual/distributed-computing/#man-shared-arrays

sash-a changed the title ~~Slowdown when running larger multiple models in parallel~~ Slowdown when running multiple large models in parallel Dec 10, 2021

ToucheSir added the discussion label Dec 10, 2021

sash-a closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowdown when running multiple large models in parallel #1806

Slowdown when running multiple large models in parallel #1806

sash-a commented Dec 10, 2021 •

edited

Loading

DhairyaLGandhi commented Dec 10, 2021

sash-a commented Dec 10, 2021

ToucheSir commented Dec 10, 2021

sash-a commented Dec 10, 2021

sash-a commented Dec 10, 2021

ToucheSir commented Dec 10, 2021

Slowdown when running multiple large models in parallel #1806

Slowdown when running multiple large models in parallel #1806

Comments

sash-a commented Dec 10, 2021 • edited Loading

DhairyaLGandhi commented Dec 10, 2021

sash-a commented Dec 10, 2021

ToucheSir commented Dec 10, 2021

sash-a commented Dec 10, 2021

sash-a commented Dec 10, 2021

ToucheSir commented Dec 10, 2021

sash-a commented Dec 10, 2021 •

edited

Loading