Does the number of threads affect the executed code? #34102

PetrKryslUCSD · 2019-12-14T14:45:23Z

The work is proportional to the number of elements.
For a mesh of 128000 elements both a serial and 1-thread
simulation carry out the computational work in 2.5 seconds.

For a mesh of 1024000 elements both a serial and 1-thread
simulation carry out the computational work in around 20.0 seconds.
So, eight times more work, eight times longer.

Now comes the weird part. When I use 2 threads, so that each thread
works on 512000 elements, the amount of work per thread is
10 seconds. However the work procedure shows that it consumes
around 16.5 seconds.

When I use 4 threads, each thread works on 256,000 elements,
and consequently the work procedure should execute
in 5 seconds. However, the work procedure actually shows
that it consumes roughly 15.6 seconds.

With 8 threads, each thread works on 128,000 elements,
and the work procedure should only take 2.5 seconds.
However, it reports to take roughly 14 seconds.

The threaded execution therefore looks like this:

Number of elements Number of threads Execution time
per thread
1024000 1 20
512000 2 16.5
256000 4 15.6
128000 8 14

The weird thing is I time the interior of the work procedure.
So that should exclude any overhead associated with threading.
However, as you can see the number of threads actually affects
how much time the work procedure spends doing the work.
The total amount of time farming out the work to the threads is
very small. The total amount of time collecting the data with
wait pretty much is equal to the amount of time reported by
the work procedure. As if the overhead related to threading was
very small.

The whole thing can be exercised by

git clone https://github.com/PetrKryslUCSD/FinEtoolsDeforNonlinear.jl

followed by

cd FinEtoolsDeforNonlinear.jl
export JULIA_NUM_THREADS=8
julia

and

include("threaded_test.jl")

I'm sorry I don't have a more minimal working example!

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2019-12-14T19:41:59Z

So if I understand correctly, this is an instance of poor performance scaling when adding threads?

PetrKryslUCSD · 2019-12-14T21:36:37Z

It is a little bit more interesting (and harder to understand). The numerical work scales (should scale) perfectly with the number of elements to process: 1/8 of the elements, 1/8 of the time. That is not the case here however. For 1 million elements the numerical work takes 20 seconds, so for 1/8 of the elements it should take 2.5 seconds. It doesn't, it takes 14 seconds.

For a scaling problem one would expect that the numerical work would take progressively less time with an increase in the number of threads, but the thread overhead would eventually make adding more threads pointless. What doesn't work here is the time for the numerical work doesn't decrease the way it is expected to. As if the number of threads had something to do with the time it takes to do the numerical work, in some complicated way.

It is almost as if the numerical work for the threads was done in serial instead of in parallel.

KristofferC · 2019-12-14T21:46:40Z

It still seems your assume that the rate of work of one thread is independent of how many other threads are running?

Or, are you saying for a fixed number of threads, doubling the number of elements increases the runtime with more than a factor of two?

PetrKryslUCSD · 2019-12-15T03:27:28Z

I started from 1 million elements. Halving that by using two threads, each operating on 500 thousand elements should allow each thread to finish in half the time. And so on. That is not observed in my simulations. The amount of time spent computing in the work routine decreases with an increase of the number of threads, but in some complicated way, not matching the above predictions. For example, using 8 threads, each operating on 1000000/8 elements, the work routine for each thread should only spend 2.5 seconds computing. The measured time was 14 seconds (!) spent inside the work routine of each thread.

…

On Sat, Dec 14, 2019, 10:46 PM Kristoffer Carlsson ***@***.***> wrote: It still seems your assume that the rate of work of one thread is independent of how many other threads are running? Or, are you saying for a fixed number of threads, doubling the number of elements increases the runtime with more than a factor of two? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#34102?email_source=notifications&email_token=ACLGGWGATJOM5W7DHM6KYB3QYVH4DA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4MBIA#issuecomment-565756064>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLGGWB4QSVY3YTY3PA3L4LQYVH4DANCNFSM4J2254AA> .

RalphAS · 2019-12-15T03:48:32Z

The functions in your thread blocks include loops over elements where each iteration involves linear algebra with small (but not tiny) matrices. It appears that OpenBLAS uses a buffer pool protected by a system lock, so these will generate a lot of ~~expensive context switches~~ lock contention when multithreaded. If you can face refactoring crucial variables into StaticArrays, you could avoid this problem as well as some GC contention.

edit: it's just a Mutex

ViralBShah · 2019-12-15T04:25:14Z

A quick check would be to set OPENBLAS_NUM_THREADS=1, so that blas calls do not get multi-threaded.

PetrKryslUCSD · 2019-12-15T08:17:06Z

Thank you for your suggestions. Alas, no difference. I have

julia> ENV                                                              
Base.EnvDict with 77 entries:    
...
  "JULIA_NUM_THREADS"               => "8"     
...
  "OPENBLAS_NUM_THREADS"            => "1"

I check the number of threads with

ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ()) = 1

so it is set to 1.

PetrKryslUCSD · 2019-12-15T08:18:15Z

@RalphAS : Would you know what the cutoff for using threading in BLAS in terms of the matrix dimension is?

PetrKryslUCSD · 2019-12-15T08:22:05Z

If I set the number of threads to 1, would BLAS thread synchronization be truly turned off, or is it possible that it is still happening (except for a single thread)? There is at least one place where I could find

if (num_cpu > 0) exec_blas_async_wait(num_cpu, &queue[0]);

....

PetrKryslUCSD · 2019-12-15T08:25:56Z

I have done the threading in two ways, but both get the same result. So I don't think I messed up writing the threaded execution. First using Threads.@spawn:

tasks = [];
for th in 1:length(threadbuffs)
	push!(tasks, Threads.@spawn begin 
		tim2  = time()
		fill!(threadbuffs[th].assembler.F_buffer, 0.0)
		# now add the restoring force from the subset of the mesh handled by this thread
		restoringforce(threadbuffs[th].femm, threadbuffs[th].assembler, geom, un1, un, tn, dtn, true)
		println("Thread $(Threads.threadid()): $(time() - tim2)")
	end);
	
end
println("Farm out work: $(time() - tim1)")
# Wait for the threads to finish, and then add the force from the thread to the global force vector
tim1  = time()
for th in 1:length(tasks)
	Threads.wait(tasks[th]);
	Fn .+= threadbuffs[th].assembler.F_buffer
end
println("Collect results: $(time() - tim1)")
println("Total: $(time() - tim)")

Second, using the Threads.@threads threaded loop:

tim = time()
tim1  = time()
Threads.@threads for th in 1:length(threadbuffs)
begin
	tim2  = time()
	fill!(threadbuffs[th].assembler.F_buffer, 0.0)
	# now add the restoring force from the subset of the mesh handled by this thread
	restoringforce(threadbuffs[th].femm, threadbuffs[th].assembler, geom, un1, un, tn, dtn, true)
	println("Thread $(Threads.threadid()): $(time() - tim2)")
end
end
println("Farm out work: $(time() - tim1)")
# Wait for the threads to finish, and then add the force from the thread to the global force vector
tim1  = time()
for th in 1:length(threadbuffs)
	Fn .+= threadbuffs[th].assembler.F_buffer
end
println("Collect results: $(time() - tim1)")
println("Total: $(time() - tim)")

The results have of course been checked for correctness: I get the same results with serial execution, and with an arbitrary number of threads.

KristofferC · 2019-12-15T09:07:15Z

I started from 1 million elements. Halving that by using two threads, each operating on 500 thousand elements should allow each thread to finish in half the time.

Only with perfect scaling which there could be many factors that cause it not to be. For element assembly I've observed pretty good scaling though (https://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/)

PetrKryslUCSD · 2019-12-15T09:19:40Z

I was talking about the work routine only, i.e. without the thread overhead

…

On Sun, Dec 15, 2019, 10:07 AM Kristoffer Carlsson ***@***.***> wrote: I started from 1 million elements. Halving that by using two threads, each operating on 500 thousand elements should allow each thread to finish in half the time. Only with perfect scaling which there could be many factors tgat cause it not to be. For element assembly I've observed pretty good dvd though ( https://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#34102?email_source=notifications&email_token=ACLGGWFZSNGFBTLJ2QPZWWLQYXXUJA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4UQCQ#issuecomment-565790730>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLGGWFB7AJXYJDHDZMKORTQYXXUJANCNFSM4J2254AA> .

KristofferC · 2019-12-15T11:28:58Z

What do you mean without the thread overhead? I still don't understand how you are not assuming that the number of elements per second a thread can do is independent of the total number of threads running concurrently, eg when you say

Halving that by using two threads, each operating on 500 thousand elements should allow each thread to finish in half the time.

PetrKryslUCSD · 2019-12-15T12:43:32Z

The number of elements processed in the WORK ROUTINE (the restoringforce() in the code above) is independent (or should be) of the number of threads. For instance, when for a mesh of 256K elements I run a single thread, the work done is the same as in each thread when running four (4) threads for a 1024000 mesh. The work routine should consume the same amount of time per thread. That is NOT what I see.

Now when the overhead of threading is added, perfect scaling is unattainable, but that is not my concern here.

Does it make sense now?

KristofferC · 2019-12-15T12:45:31Z

The work routine should consume the same amount of time per thread.

Why? The fact that other threads are running concurrently might make this not true?

PetrKryslUCSD · 2019-12-15T12:49:30Z

Of course assuming that each thread can run uninterrupted (which may be the case for a sufficiently low load on the machine).

PetrKryslUCSD · 2019-12-15T12:50:46Z

Remember that the work routine does purely numerical work, of precisely defined amount. No communication, no synchronization.

KristofferC · 2019-12-15T13:00:17Z

Remember that the work routine does purely numerical work, of precisely defined amount. No communication, no synchronization.

There is still a number of reasons why you won't get perfect scaling even in purely numerical workloads. You can have false sharing, oversubscription of threads (like the discussion about OpenBLAS), higher pressure on memory, relocating threads to other cores etc etc. Especially allocations I have found to be bad for scaling threaded workloads so make sure you are allocating as little as possible and measure GC time.

Scaling like

1024000 1 20
512000 2 16.5
256000 4 15.6
128000 8 14

doesn't seem too weird at all.

Basically, the comment

The weird thing is I time the interior of the work procedure.
So that should exclude any overhead associated with threading.

is just not true and seems to be the core (no pun intended) of this issue.

PetrKryslUCSD · 2019-12-15T13:12:07Z

There is still a number of reasons why you won't get perfect scaling even in purely numerical workloads. You can have false sharing, oversubscription of threads (like the discussion about OpenBLAS), higher pressure on memory, relocating threads to other cores etc etc. Especially allocations I have found to be bad for scaling threaded workloads so make sure you are allocating as little as possible and measure GC time.

Even with just two threads the timing does not work. I go to great lengths to avoid allocations. Not sure what is going on with blas yet. Setting number of blas threads to 1 did not do anything.

PetrKryslUCSD · 2019-12-15T13:55:11Z

@KristofferC : I think we are talking at cross purposes. It seems to me you mean a situation with all sorts of bad things happening. I mean a situation where things go well. There seems to be something going wrong in my simulations, which might explains the bad timings. The question is what needs to be fixed to make it better.

ViralBShah · 2019-12-15T14:59:14Z

It would greatly help to have a minimal reproducer.

vchuravy · 2019-12-15T14:59:14Z

@PetrKryslUCSD these kinds of long-form discussion on how to optimize an application and get best performance out of threading, are better suited
for discourse (https://discourse.julialang.org/c/user/perf/37) until we established that there is an issue with Julia base at play.

It is hard to judge what the scaling of your application should be and doing so requires some digging in. I would recommend doing some profiling.

PetrKryslUCSD · 2019-12-15T15:59:26Z

@vchuravy : I don't see why this issue was closed. I think we have established that there is some interaction between the threading at Julia source code level and BLAS. In my opinion that is part of Base. As such it should be discussed here.

KristofferC · 2019-12-15T16:25:53Z

You can swap out the BLAS calls for generic non-threaded versions and see what happens.

Basically, so far, this has only been an example of an application with non-linear scaling which is more the rule than the exception so it is unclear what to do here.

StefanKarpinski · 2019-12-15T16:30:48Z

This seems like a fairly concrete issue with a reproducer, even if not a minimal one. Reopening to at least get to more of a conclusion here.

KristofferC · 2019-12-15T16:36:43Z

Ok so what should be done here?

PetrKryslUCSD · 2019-12-15T16:36:47Z

Now this is interesting: how do I swap out threaded blas for non-threaded? Petr Krysl

…

On Sun, Dec 15, 2019 at 5:25 PM Kristoffer Carlsson < ***@***.***> wrote: You can swap out the BLAS calls for generic non-threaded versions. Basically, so far, this has only been an example of an application with non-linear scaling which is more the rule than the exception so it is unclear what to do here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34102?email_source=notifications&email_token=ACLGGWEW3BIRSTODAUCLCXTQYZLBFA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG44WYA#issuecomment-565824352>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACLGGWAA5AYSBDCVEB6YBDTQYZLBFANCNFSM4J2254AA> .

PetrKryslUCSD · 2019-12-15T16:37:21Z

I am replacing all blas calls with hand-coded routines. I will post the results soon.

ViralBShah · 2019-12-15T18:15:07Z

One way would be to change the deps/blas.mk in Julia source to disable multi-threading, disable binary builder, and build the Julia source. I thought setting OPENBLAS_NUM_THREADS to 1 was sufficient, but to be absolutely sure, you have to just disable it altogether in the build, which does require you to wade through the Julia build process.

PetrKryslUCSD · 2019-12-15T19:19:38Z

I replaced all (most?; not sure yet) BLAS calls with hand coded Julia. The results are encouraging, but also confusing the issues further, at least in some respect.

The whole thing runs faster, even for serial execution. 7.3 vs. 19 seconds for the serial case.
The speed up is now reasonable: efficiency is 0.82 for 2 threads, 0.62 for 4 threads, and 0.51 for 8 threads. Previously the speed up was an unmitigated disaster (efficiency of 0.57, 0.3, 0.17).
So clearly there must be some interaction between the BLAS threads and the Julia threads.
It was not possible to turn off this interaction by setting the number of blas threads to 1.

The last item is clearly a cause for concern. People won't be able to efficiently utilize Julia threads unless the interaction with lower-level libraries can be resolved.

ViralBShah · 2019-12-15T23:25:01Z

Yes, Julia and BLAS do not know about each other's threads. We need to work through this: #32786

RalphAS · 2019-12-16T00:37:29Z

OpenBLAS has a fairly high threshold for multithreading: for example the version I've studied is serial up to 256K products in GEMM (e.g. 64x64 times 64x64). As I tried to say earlier, the issue here is not oversubscription of threads but competition for access to memory management (similar to GC in Julia itself). For larger matrices there is a plan to make BLAS multi-threading cooperative with Julia's runtime. For middle-sized matrices there may be enough work to make up for the lock contention.

BLAS is just not designed for small matrices; it works, but not well. For tiny matrices (up to 3x3) Julia StdLib already has some fast code. For the small matrices you have here, good performance takes work - either hand-coding, interfacing to a package like StaticArrays, or using a library for "batch linear algebra".

ViralBShah · 2019-12-16T01:56:14Z

We do set the multi-threading threshold to 50. I assume that means multi-threading kicks in only for matrices larger than 50x50. Would be great to see those new interfaces get developed!

PetrKryslUCSD · 2019-12-18T01:34:47Z

OpenBLAS has a fairly high threshold for multithreading: for example the version I've studied is serial up to 256K products in GEMM (e.g. 64x64 times 64x64). As I tried to say earlier, the issue here is not oversubscription of threads but competition for access to memory management (similar to GC

So, to verify that I understand: I think you're saying that BLAS allocates some workspace(s),
and the library protects this with locks in a multi-threading environment. Which then leads to a general slowdown of any code that uses BLAS in any form. Is that a correct understanding of your argument?

BLAS is just not designed for small matrices; it works, but not well. For tiny matrices (up to 3x3) Julia StdLib already has some fast code. For the small matrices you have here, good performance takes work - either hand-coding, interfacing to a package like StaticArrays, or using a library for "batch linear algebra".

This sounds intriguing: what is "batch linear algebra"?

Thanks for your help.

PetrKryslUCSD · 2019-12-18T02:02:44Z

Upon rewriting my code to avoid BLAS in any form, I achieved the following parallel efficiency
on a better Windows machine (32 cores, 128 gig of memory). Run with the maximum number of threads set to 24 (and the default number of blas threads).
2 threads: 0.95
4 threads: 0.87
16 threads : 0.42

On a Linux machine (64 cores, 256 gigabyte of memory, the maximum number of threads to use set to 32), the following parallel efficiencies could be achieved:
2 threads: 0.96
4 threads: 0.93
16 threads : 0.91

It would appear that the current thread scheduling is quite functional. What does need some improvement is the management of expectations when a user combines user-level threaded code with library-level multithreaded code (such as BLAS). Currently the documentation does not manage expectations well: there should be a clear statement about the current thread scheduler not being able to coordinate with BLAS . From what I have seen, writing a multithreaded code that happens to call BLAS will lead to a disappointment. Furthermore, there should be a warning in the documentation that currently setting the number of BLAS threads to 1 will NOT improve the performance of multithreaded code by reducing contention within the BLAS.

I'd be happy to draft something, should you agree.

tkf · 2019-12-18T05:20:47Z

We do set the multi-threading threshold to 50. I assume that means multi-threading kicks in only for matrices larger than 50x50.

Does it mean it'd be better to use pure-Julia implementation for matrices smaller than 50x50? For example, we use pure-Julia implementation in mul! for 2x2 and 3x3. Does it make sense to bump this up to 50?

chriscoey · 2019-12-18T06:02:12Z

@tkf does that make more sense than fixing #32786?

tkf · 2019-12-18T06:16:22Z

My understanding is that what I mentioned is orthogonal to #32786. #32786 is useful if you want to let OpenBLAS use multiple threads (the problem is large enough). Using pure-Julia implementation when the problem is not large enough is useful for avoiding the lock contention #34102 (comment).

vtjnash · 2021-09-29T20:26:10Z

Re-closing, as there doesn't seem anything actionable here, not already covered by other issues

vchuravy closed this as completed Dec 15, 2019

StefanKarpinski reopened this Dec 15, 2019

KristofferC added needs more info Clarification or a reproducible example is required parallelism Parallel or distributed computation labels Dec 15, 2019

vtjnash closed this as completed Sep 29, 2021

Does the number of threads affect the executed code? #34102

Does the number of threads affect the executed code? #34102

Comments

PetrKryslUCSD commented Dec 14, 2019

JeffBezanson commented Dec 14, 2019

PetrKryslUCSD commented Dec 14, 2019 • edited Loading

KristofferC commented Dec 14, 2019

PetrKryslUCSD commented Dec 15, 2019 via email

RalphAS commented Dec 15, 2019 • edited Loading

ViralBShah commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019 • edited Loading

PetrKryslUCSD commented Dec 15, 2019 • edited Loading

KristofferC commented Dec 15, 2019 • edited Loading

PetrKryslUCSD commented Dec 15, 2019 via email

KristofferC commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019 • edited Loading

KristofferC commented Dec 15, 2019 • edited Loading

PetrKryslUCSD commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019 • edited Loading

KristofferC commented Dec 15, 2019 • edited Loading

PetrKryslUCSD commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019

ViralBShah commented Dec 15, 2019

vchuravy commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019

KristofferC commented Dec 15, 2019 • edited Loading

StefanKarpinski commented Dec 15, 2019

KristofferC commented Dec 15, 2019

PetrKryslUCSD commented Dec 15, 2019 via email

PetrKryslUCSD commented Dec 15, 2019

ViralBShah commented Dec 15, 2019 • edited Loading

PetrKryslUCSD commented Dec 15, 2019

ViralBShah commented Dec 15, 2019

RalphAS commented Dec 16, 2019

ViralBShah commented Dec 16, 2019 • edited Loading

PetrKryslUCSD commented Dec 18, 2019

PetrKryslUCSD commented Dec 18, 2019 • edited Loading

tkf commented Dec 18, 2019

chriscoey commented Dec 18, 2019

tkf commented Dec 18, 2019

vtjnash commented Sep 29, 2021

PetrKryslUCSD commented Dec 14, 2019 •

edited

Loading

RalphAS commented Dec 15, 2019 •

edited

Loading

PetrKryslUCSD commented Dec 15, 2019 •

edited

Loading

PetrKryslUCSD commented Dec 15, 2019 •

edited

Loading

KristofferC commented Dec 15, 2019 •

edited

Loading

PetrKryslUCSD commented Dec 15, 2019 •

edited

Loading

KristofferC commented Dec 15, 2019 •

edited

Loading

PetrKryslUCSD commented Dec 15, 2019 •

edited

Loading

KristofferC commented Dec 15, 2019 •

edited

Loading

KristofferC commented Dec 15, 2019 •

edited

Loading

ViralBShah commented Dec 15, 2019 •

edited

Loading

ViralBShah commented Dec 16, 2019 •

edited

Loading

PetrKryslUCSD commented Dec 18, 2019 •

edited

Loading