-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowdown when running multiple large models in parallel #1806
Comments
One comment i would have is to first removing the allocation cost to see how the time is being spent. Next is performing a simple matmul to see whether switching out OpenBLAS to MKL would be better, and third finding out the GC times over extended runs. Note too, that calling the NNs in a threaded loop is probably not best. Try it so that the work is scheduled using the new scheduler. I would also repeat this with the backwards pass. |
Not quite following this? Do you mean I shouldn't store the
Do I need to do anything special to switch to BLAS or MKL? Or is it as simple as switching the
Two things, can you give me a link to the docs for julia's new scheduler? But shouldn't a scheduler not make a difference here, this is very much an embarrassingly parallel problem? |
Julia threads use shared memory, so no copying should be required.
If you're running Julia 1.7, testing with MKL is just an As for the question about slowdowns, note that your |
I was just thinking this might be the case, was about to ask how to set it to use 1 thread, so thanks for that. I'll give that a test and see if it fixes the problem |
BLAS competing for threads was the problem, thanks for the tip @ToucheSir!
I'm a bit confused by this, are they? Why do SharedArrays exist then? Genuinely curious about this so would appreciate a response, but the issue is solved so closing it |
SharedArrays are for cross-process coordination. If you haven't seen them yet, this page has a decent rundown: https://docs.julialang.org/en/v1/manual/distributed-computing/#man-shared-arrays |
When running multiple models in parallel (julia native threads) at the same time I see a change in the scaling as the models get larger. The slowdown occurs somewhere between a model size of 6000 to 70 000 parameters.
My initial though is that this is because copying the larger model across each thread outweighs the speedup from parallelism, but I've tried this with
repeats
as high as 3000 and the inner loop oftestflux
as high as 1000 and I see the same curve, so I don't think that is the case. I've also tried this usingFlux.destructure
and putting the parameters in aSharedArray
(thus only copying therecostructor
method across the threads) before benchmarking and I again saw similar slowdowns.Here's the graph
lrg: 68613 params
med: 5733 params
sml: 453 params
med and small see pretty close to linear performance, but lrg just doesn't seem to scale at all, so I'm not sure what the problem is?
x axis: number of threads
y axis: speedup (serial time / time taken)
Julia 1.6:
If anyone has some insight into why this is happening it would be much appreciated
The text was updated successfully, but these errors were encountered: