-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of OpenBLAS matrix-vector multiplication on small sizes #3239
Comments
See also #2489. We could handle small matvecs on our own, but this should probably also be filed as an OpenBLAS bug. |
I'm quite surprised with how large an N this extends to. I can partially confirm this, but on my machine OpenBLAS is much more better:
OpenBLAS is very architecture-specific. |
On my machine, I get performance more like Tim's:
|
Is this with OPENBLAS_NUM_THREADS=1? |
Perhaps we should have some general infrastructure to deal with such cases to handle changeover from native julia to library implementations. Kind of like multiple dispatch but based on problem size. |
Yes! We can call it an "if statement". |
Setting
@ViralBShah We could, but this should be a problem of OpenBLAS. MKL also uses threading, but it handles small matrices gracefully. I suggest we analyzer performance of OpenBLAS on Windows a little further and if such behavior is a pattern we can disable threading at all (at least on Windows). Can anybody check if this problem exists on Linux/MacOS? |
On my (oldish, non-hyperthreading) machines, running fresh windows and linux 64-bit linux julia builds, OpenBlas gemv does not look too good either. Below is a slightly modified timing table, where I have replaced naive_A_mul_B with a cache friendly version that iterates The code I used is here: https://github.com/bsxfan/ad4julia/blob/master/timegemv.jl. (Compared to @vharavy's test code, I think my normalized timing for n=4 sufffers a bit from overhead to decide which of my MV functions to call, but for larger n, the division by n^2 makes this constant overhead insignificant.) For my machines at least, I would conclude you don't want to be using any openblas-based matrix vector multiplication for n<60 or so and for larger n, multithreading doesn't hurt, but doesn't help very much either. (For big matrix-matrix multiplications however, the multithreading does help.) Windows (2 cores):
Linux (8 cores):
|
Cc: @xianyi |
For a moment I thought my timings here were in contrast with this post (https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/s3_azLP6oyY) where I compared Julia's Curiously, exactly at n=1024 there is a bad spot. Using my test code referred to above, I compare multithreaded gemv time relative to loopMV time (smaller than one means blas is better):
At 1024 it takes more than 10 times longer than at 1023 or 1025. (From the table in my previous post, n=5 is also a particularly bad spot for multithreaded GEMV.) |
The performance hit at size |
Blaze requires a separate BLAS to be installed. |
As @JeffBezanson notes, Blaze ends up using BLAS at a low level, as does Armadillo. Other C++ templated linear algebra systems like Eigen create linear algebra classes that do not use the BLAS. In all these cases, however, the objective is to organize vector types, matrix types and various factorizations in some kind of an object-oriented framework so the programmer can write expressions in a natural way and still have them evaluated efficiently. We are steadily building such a framework in Julia but with a big difference. Julia provided multiple dispatch whereas systems based on C++ must build methods into distinct classes. Many linear algebra calculations like products and solving linear systems are poster children for multiple dispatch. The decision of how to evaluate the expression should depend on both the left and the right operands. In Julia this is possible. And the core developers have build marvelous parsing rules to provide efficient evaluation of many expressions like A'B. Accomplishing this using delayed evaluation techniques in languages like C++ ends up with some truly horrible looking code. |
Frankly speaking, C++ template libraries have its advantages over Julia (at least at this moment). For example, C++ template libraries can achieve lazy evaluation with zero runtime-overhead even with very complex expressions. A correctly implemented C++ template library can evaluate complex vectorized expression like The way they achieve this is to use template-based generic programming (OOP is not actually used a lot in such libraries). So they can achieve multiple dispatch in a static sense. Such libraries could be tricky to implement and C++ meta-programming codes can be quite difficult to follow. However, the client codes that use the library are usually concise and friendly. Having said that, Julia does have many other advantages though. |
I would point out that all of that can also be done in Julia just as easily On Mon, Jun 17, 2013 at 6:17 PM, Dahua Lin [email protected] wrote:
|
@lindahua I was banking on your |
This is similiar to OpenMathLib/OpenBLAS#103. |
As a concrete example of what Stefan was talking about (just came up in my own coding), lazy evaluation works great for simple cases but runs into trouble quickly. Compare something like |
@timholy You are right that special care need to be taken for such cases. A sophisticated matrix library typically comes with a large bunch of specialized template functions that dispatch the computation based on the type of expression. If expressions are constructed recursively, The expressions are not always lazy evaluated, one can write specialized rules to evaluate part of the expression earlier than the others. Having said that, I agree that things can quite complicated. C++ Boost has a library Proto, which allows you to express complex mapping between expression and computation in a relatively concise and systematic way. The C++ library NT2 is a library (much more comprehensive than Eigen, Armadillo, or Blaze) that uses Proto to give a complete coverage of such things that try to map all kinds of expressions that one may ever conceive to the most efficient computation routines. The cost is that the library becomes very complex, with several hundreds of thousands lines of code (probably 3x - 4x larger than the Julia codebase), and the compilation cost is high -- the compiler needs to work through all rules (expressed in template meta-programing) to actually emit the code. |
I am not suggesting to take such approach in Julia. I agree with @StefanKarpinski that this will become brittle and overly complex. Instead I think the compilation optimization stuff that Jeff is working on may achieve similar goals in a cleaner way. I am looking forward to that. |
My hypothesis is that most of the performance overhead can be eliminated without lazy evaluation via:
I'm not positive about this, but I suspect that these three improvements together with other compiler work will get us to where we need to be without losing the nice clean eager, dynamic semantics that we currently have. |
My opinion is that you all forget what this issue is about. We need to make sure that basic BLAS and LAPACK functions work as fast as possible (or at least not as slow as some of them work now). Julia does not need a very complicated linear algebra like above mentioned Blaze, Proto, NT2 or Armadillo. Small optimizations like mentioned by @StefanKarpinski is what will make Julia's linear algebra fast and flexible. |
Sorry about the earlier mess, I responded in email rather than writing here @StefanKarpinski suggested "Better syntax and support for in-place operations in the cases where you really can't afford to use a temporary." and my response was I was going to suggest this in a more general issue regarding further enhancements of linear algebra methods but, since you brought it up ... I suggest adding methods for If this seems reasonable I will begin implementation. |
@vharavy I agree that the point in this issue is lost in discussion. As @xianyi mentioned, the openblas issue with threading is known, and the cut-offs for using multiple threads need to be improved in openblas. If threads are disabled, you will get worse performance on larger problems. For the time being, just like we have julia implementation for small matrix multiplication, we could extend that for small matvecs as well. |
@dmbates, or you could write the three-argument version |
@timholy For the most common cases where This issue is related to an issue or thread by @johnmyleswhite regarding the names of mutating function variants like the three argument version of Ac_mul_B, but I can't find the issue now. I also had the impression that @lindahua mentioned something like this in his announcement of the NumericFunctors package, but I can't find that either. (The google is not strong in you today, my young apprentice.) For function in base Julia at least there should be consistent naming conventions for the versions that mutate an input argument and those that overwrite an output argument in place. So for the case of
I think that could be applied consistently to many of the other numerical linear algebra functions. cc: @ViralBShah, @andreasnoackjensen |
One small observation: when chatting with @stevengj yesterday, we discovered that OpenBLAS appears to use OpenMP's static scheduler in some core driver loop. Simply changing the scheduler directive to julia> y=x=randn(10,10); @time begin for i=1:10^6; y=x*x; end; end; y[1] #With schedule(static) as in v0.2.8
elapsed time: 3.593052362 seconds (928000000 bytes allocated)
julia> y=x=randn(10,10); @time begin for i=1:10^6; y=x*x; end; end; y[1] #With schedule(auto) / v0.2.8
elapsed time: 2.638416476 seconds (928000000 bytes allocated)
julia> versioninfo()
Julia Version 0.3.0-prerelease+2145
Commit a9949c6* (2014-03-22 04:06 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.1.0)
CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY)
LAPACK: libopenblas
LIBM: libopenlibm |
Recently found that complex matrix multiplication gives wrong results, even with a 4x4 matrix, in a machine with Windows7-pro running on a AMD FX(tm)-8320 Eight-Core Processor. Error appears both in Julia 0.2.1 and 0.3-dev. See issue JuliaLang/LinearAlgebra.jl#117. |
Can you give an example? I'm really disturbed by what seem to be a lot of cases of OpenBLAS giving incorrect answers lately. |
Crazy idea. Anyone want to try incorporating https://github.com/flame/blis and https://github.com/flame/libflame as alternate BLAS/LAPACK backends? |
That is a lot of work. Is it blas compatible? How does the performance compare? |
Read their FAQs. Major issue seems to be windows support. It does have a blas interface. |
Would not expect it to be a quick job, no. More like an up-for-grabs experiment if someone's feeling adventurous. According to their publications (http://www.cs.utexas.edu/users/flame/pubs/BLISTOMS.pdf), serial performance is comparable but it's not clear to me whether they've implemented multithreading yet. Edit: if the FAQ entry about not supporting building as a shared library still applies, then it may be a non-starter for now. |
In the thread I mentioned I shall stress that I have 2 machines with Windows 7 64 bits, and this Also, this happens only with complex matrices, not with real matrices. If someone has a similar setup (windows + amd 8-core), it would be Below is the code and a sample result (I generate random matrices, the ######## "*" OF RANDOM MATRICES HIGHLIGHTING THE ERROR ########## N=4 LoopLU=zeros(Float64,N,N) Product of L*U is loopedfor i in 1:N MC=zeros(Complex{Float64},N,N) L,U,p=lu(MC) Product of L*U is loopedfor i in 1:N ########## RESULT in AMD 8320, julia 0.3-dev #### [3,4,2,1]
On 6/1/14, Viral B. Shah [email protected] wrote:
|
This issue is something I noticed while developing StaticArrays. For very small sizes, there is a lot of overhead coming from Julia, let alone BLAS. For instance, there are specialized 2x2 and 3x3 matrix multiplication code, but after a very quick look through the code it passes through several functions repeatedly checking size matching, checking the character (!!) that describes if its transposed or not, etc before it gets to evaluation. A lot of this run-time overhead could be eliminated with e.g. @StefanKarpinski wrote:
That sounds pretty cool, too! |
Sorry to resurrect an old thread, but I am also seeing performance issues for small Mat*Vec operations. Is there a fix on the horizon for this ? Just want to know whats the best way to solve the problem. I am happy with writing my own optimized I am on Julia 0.6 release using openblas.
|
Please ask questions at discourse.julialang.org. |
@andreasnoack Understood. Generally I would, but it was related to this thread so I asked it here. |
Use StaticArrays.jl (they also have mutable versions). |
StaticArrays.jl is the only real solution for small matvecs above as @KristofferC points out. I am closing this one since it is unlikely OpenBLAS is going to be able to do very much about it, and even if it does, StaticArrays.jl will still be much better. |
I will note that I don't quite agree with that, Viral - there are problems here we can feasibly fix in |
My main reason to close this issue was that things won't change in OpenBLAS. Perhaps we should rename this issue for the |
BTW, while matrix-matrix multiply has a deep dispatch pattern, I believe matrix-vector is not so deep. |
Consider the two following tests. The first one uses Julia's
A_mul_B
(OpenBLAS'sdgemv
):The second one implements matrix-vector multiplications itself:
The results are (
count = 100 000
)This is really sad.
P.S. I ran the same test using
dgemv
that comes with MKL and results are even sadder:The text was updated successfully, but these errors were encountered: