-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the number of threads affect the executed code? #34102
Comments
So if I understand correctly, this is an instance of poor performance scaling when adding threads? |
It is a little bit more interesting (and harder to understand). The numerical work scales (should scale) perfectly with the number of elements to process: 1/8 of the elements, 1/8 of the time. That is not the case here however. For 1 million elements the numerical work takes 20 seconds, so for 1/8 of the elements it should take 2.5 seconds. It doesn't, it takes 14 seconds. For a scaling problem one would expect that the numerical work would take progressively less time with an increase in the number of threads, but the thread overhead would eventually make adding more threads pointless. What doesn't work here is the time for the numerical work doesn't decrease the way it is expected to. As if the number of threads had something to do with the time it takes to do the numerical work, in some complicated way. It is almost as if the numerical work for the threads was done in serial instead of in parallel. |
It still seems your assume that the rate of work of one thread is independent of how many other threads are running? Or, are you saying for a fixed number of threads, doubling the number of elements increases the runtime with more than a factor of two? |
I started from 1 million elements. Halving that by using two threads, each
operating on 500 thousand elements should allow each thread to finish in
half the time. And so on. That is not observed in my simulations. The
amount of time spent computing in the work routine decreases with an
increase of the number of threads, but in some complicated way, not
matching the above predictions. For example, using 8 threads, each
operating on 1000000/8 elements, the work routine for each thread should
only spend 2.5 seconds computing. The measured time was 14 seconds (!)
spent inside the work routine of each thread.
…On Sat, Dec 14, 2019, 10:46 PM Kristoffer Carlsson ***@***.***> wrote:
It still seems your assume that the rate of work of one thread is
independent of how many other threads are running?
Or, are you saying for a fixed number of threads, doubling the number of
elements increases the runtime with more than a factor of two?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#34102?email_source=notifications&email_token=ACLGGWGATJOM5W7DHM6KYB3QYVH4DA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4MBIA#issuecomment-565756064>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLGGWB4QSVY3YTY3PA3L4LQYVH4DANCNFSM4J2254AA>
.
|
The functions in your thread blocks include loops over elements where each iteration involves linear algebra with small (but not tiny) matrices. It appears that OpenBLAS uses a buffer pool protected by a system lock, so these will generate a lot of edit: it's just a Mutex |
A quick check would be to set OPENBLAS_NUM_THREADS=1, so that blas calls do not get multi-threaded. |
Thank you for your suggestions. Alas, no difference. I have
I check the number of threads with
so it is set to 1. |
@RalphAS : Would you know what the cutoff for using threading in BLAS in terms of the matrix dimension is? |
If I set the number of threads to 1, would BLAS thread synchronization be truly turned off, or is it possible that it is still happening (except for a single thread)? There is at least one place where I could find
.... |
I have done the threading in two ways, but both get the same result. So I don't think I messed up writing the threaded execution. First using
Second, using the
The results have of course been checked for correctness: I get the same results with serial execution, and with an arbitrary number of threads. |
Only with perfect scaling which there could be many factors that cause it not to be. For element assembly I've observed pretty good scaling though (https://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/) |
I was talking about the work routine only, i.e. without the thread overhead
…On Sun, Dec 15, 2019, 10:07 AM Kristoffer Carlsson ***@***.***> wrote:
I started from 1 million elements. Halving that by using two threads, each
operating on 500 thousand elements should allow each thread to finish in
half the time.
Only with perfect scaling which there could be many factors tgat cause it
not to be. For element assembly I've observed pretty good dvd though (
https://kristofferc.github.io/JuAFEM.jl/dev/examples/threaded_assembly/)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#34102?email_source=notifications&email_token=ACLGGWFZSNGFBTLJ2QPZWWLQYXXUJA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG4UQCQ#issuecomment-565790730>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLGGWFB7AJXYJDHDZMKORTQYXXUJANCNFSM4J2254AA>
.
|
What do you mean without the thread overhead? I still don't understand how you are not assuming that the number of elements per second a thread can do is independent of the total number of threads running concurrently, eg when you say
|
The number of elements processed in the WORK ROUTINE (the Now when the overhead of threading is added, perfect scaling is unattainable, but that is not my concern here. Does it make sense now? |
Why? The fact that other threads are running concurrently might make this not true? |
Of course assuming that each thread can run uninterrupted (which may be the case for a sufficiently low load on the machine). |
Remember that the work routine does purely numerical work, of precisely defined amount. No communication, no synchronization. |
There is still a number of reasons why you won't get perfect scaling even in purely numerical workloads. You can have false sharing, oversubscription of threads (like the discussion about OpenBLAS), higher pressure on memory, relocating threads to other cores etc etc. Especially allocations I have found to be bad for scaling threaded workloads so make sure you are allocating as little as possible and measure GC time. Scaling like
doesn't seem too weird at all. Basically, the comment
is just not true and seems to be the core (no pun intended) of this issue. |
Even with just two threads the timing does not work. I go to great lengths to avoid allocations. Not sure what is going on with blas yet. Setting number of blas threads to 1 did not do anything. |
@KristofferC : I think we are talking at cross purposes. It seems to me you mean a situation with all sorts of bad things happening. I mean a situation where things go well. There seems to be something going wrong in my simulations, which might explains the bad timings. The question is what needs to be fixed to make it better. |
It would greatly help to have a minimal reproducer. |
@PetrKryslUCSD these kinds of long-form discussion on how to optimize an application and get best performance out of threading, are better suited It is hard to judge what the scaling of your application should be and doing so requires some digging in. I would recommend doing some profiling. |
@vchuravy : I don't see why this issue was closed. I think we have established that there is some interaction between the threading at Julia source code level and BLAS. In my opinion that is part of Base. As such it should be discussed here. |
You can swap out the BLAS calls for generic non-threaded versions and see what happens. Basically, so far, this has only been an example of an application with non-linear scaling which is more the rule than the exception so it is unclear what to do here. |
This seems like a fairly concrete issue with a reproducer, even if not a minimal one. Reopening to at least get to more of a conclusion here. |
Ok so what should be done here? |
Now this is interesting: how do I swap out threaded blas for non-threaded?
Petr Krysl
…On Sun, Dec 15, 2019 at 5:25 PM Kristoffer Carlsson < ***@***.***> wrote:
You can swap out the BLAS calls for generic non-threaded versions.
Basically, so far, this has only been an example of an application with
non-linear scaling which is more the rule than the exception so it is
unclear what to do here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34102?email_source=notifications&email_token=ACLGGWEW3BIRSTODAUCLCXTQYZLBFA5CNFSM4J2254AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG44WYA#issuecomment-565824352>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLGGWAA5AYSBDCVEB6YBDTQYZLBFANCNFSM4J2254AA>
.
|
I am replacing all blas calls with hand-coded routines. I will post the results soon. |
One way would be to change the |
I replaced all (most?; not sure yet) BLAS calls with hand coded Julia. The results are encouraging, but also confusing the issues further, at least in some respect.
The last item is clearly a cause for concern. People won't be able to efficiently utilize Julia threads unless the interaction with lower-level libraries can be resolved. |
Yes, Julia and BLAS do not know about each other's threads. We need to work through this: #32786 |
OpenBLAS has a fairly high threshold for multithreading: for example the version I've studied is serial up to 256K products in GEMM (e.g. 64x64 times 64x64). As I tried to say earlier, the issue here is not oversubscription of threads but competition for access to memory management (similar to GC in Julia itself). For larger matrices there is a plan to make BLAS multi-threading cooperative with Julia's runtime. For middle-sized matrices there may be enough work to make up for the lock contention. BLAS is just not designed for small matrices; it works, but not well. For tiny matrices (up to 3x3) Julia |
We do set the multi-threading threshold to 50. I assume that means multi-threading kicks in only for matrices larger than 50x50. Would be great to see those new interfaces get developed! |
So, to verify that I understand: I think you're saying that BLAS allocates some workspace(s),
This sounds intriguing: what is "batch linear algebra"? Thanks for your help. |
Upon rewriting my code to avoid BLAS in any form, I achieved the following parallel efficiency On a Linux machine (64 cores, 256 gigabyte of memory, the maximum number of threads to use set to 32), the following parallel efficiencies could be achieved: It would appear that the current thread scheduling is quite functional. What does need some improvement is the management of expectations when a user combines user-level threaded code with library-level multithreaded code (such as BLAS). Currently the documentation does not manage expectations well: there should be a clear statement about the current thread scheduler not being able to coordinate with BLAS . From what I have seen, writing a multithreaded code that happens to call BLAS will lead to a disappointment. Furthermore, there should be a warning in the documentation that currently setting the number of BLAS threads to 1 will NOT improve the performance of multithreaded code by reducing contention within the BLAS. I'd be happy to draft something, should you agree. |
Does it mean it'd be better to use pure-Julia implementation for matrices smaller than 50x50? For example, we use pure-Julia implementation in |
My understanding is that what I mentioned is orthogonal to #32786. #32786 is useful if you want to let OpenBLAS use multiple threads (the problem is large enough). Using pure-Julia implementation when the problem is not large enough is useful for avoiding the lock contention #34102 (comment). |
Re-closing, as there doesn't seem anything actionable here, not already covered by other issues |
The work is proportional to the number of elements.
For a mesh of 128000 elements both a serial and 1-thread
simulation carry out the computational work in 2.5 seconds.
For a mesh of 1024000 elements both a serial and 1-thread
simulation carry out the computational work in around 20.0 seconds.
So, eight times more work, eight times longer.
Now comes the weird part. When I use 2 threads, so that each thread
works on 512000 elements, the amount of work per thread is
10 seconds. However the work procedure shows that it consumes
around 16.5 seconds.
When I use 4 threads, each thread works on 256,000 elements,
and consequently the work procedure should execute
in 5 seconds. However, the work procedure actually shows
that it consumes roughly 15.6 seconds.
With 8 threads, each thread works on 128,000 elements,
and the work procedure should only take 2.5 seconds.
However, it reports to take roughly 14 seconds.
The threaded execution therefore looks like this:
Number of elements Number of threads Execution time
per thread
1024000 1 20
512000 2 16.5
256000 4 15.6
128000 8 14
The weird thing is I time the interior of the work procedure.
So that should exclude any overhead associated with threading.
However, as you can see the number of threads actually affects
how much time the work procedure spends doing the work.
The total amount of time farming out the work to the threads is
very small. The total amount of time collecting the data with
wait
pretty much is equal to the amount of time reported bythe work procedure. As if the overhead related to threading was
very small.
The whole thing can be exercised by
followed by
and
I'm sorry I don't have a more minimal working example!
The text was updated successfully, but these errors were encountered: