Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of OpenBLAS matrix-vector multiplication on small sizes #3239

Closed
vharavy opened this issue May 29, 2013 · 44 comments
Closed
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@vharavy
Copy link
Contributor

vharavy commented May 29, 2013

Consider the two following tests. The first one uses Julia's A_mul_B (OpenBLAS's dgemv):

function test1(n::Int, count::Int)
    A = rand(n, n)
    x = rand(n)
    y = zeros(n)
    tic()
    for i = 1 : count
        A_mul_B(y, A, x)
    end
    toq()
end

The second one implements matrix-vector multiplications itself:

function naive_A_mul_B(y::Vector{Float64}, A::Matrix{Float64}, x::Vector{Float64})
    m, n = size(A)
    for i = 1 : m
        y[i] = 0.0
        for j = 1 : n
            y[i] += A[i, j] * x[j]
        end
    end
end

function test2(n::Int, count::Int)
    A = rand(n, n)
    x = rand(n)
    y = zeros(n)
    tic()
    for i = 1 : count
        naive_A_mul_B(y, A, x)
    end
    toq()
end

The results are (count = 100 000)

N          OpenBLAS    Naive
4          0.035218    0.005754
5          0.641212    0.008684
8          0.650700    0.019424
9          0.871438    0.024323
16         1.079199    0.077812
17         1.249294    0.085496
32         2.009289    0.322948
33         2.051556    0.339220
64         2.051730    1.342379
65         2.078451    1.416710

This is really sad.

P.S. I ran the same test using dgemv that comes with MKL and results are even sadder:

N          OpenBLAS    Naive       MKL
4          0.026248    0.005759    0.011873
5          0.682821    0.008710    0.012759
8          0.703205    0.022573    0.013530
9          0.918482    0.024350    0.015054
16         1.148282    0.075867    0.020143
17         1.307566    0.085483    0.023182
32         2.035381    0.318547    0.040268
33         2.050683    0.340855    0.047492
64         2.114853    1.342950    0.142106
65         2.101495    1.418431    0.160844
@JeffBezanson
Copy link
Member

See also #2489. We could handle small matvecs on our own, but this should probably also be filed as an OpenBLAS bug.

@timholy
Copy link
Member

timholy commented May 30, 2013

I'm quite surprised with how large an N this extends to. I can partially confirm this, but on my machine OpenBLAS is much more better:

     OpenBLAS       Naive
4  0.029808847  0.008809609
5  0.123966276  0.012788681
8  0.118868987  0.028228581
9  0.121300932  0.035596733
16  0.12656238  0.111826848
17  0.125384132  0.126661398
32  0.175371494  0.452407687
33  0.189356229  0.52830851
64  0.316179272  1.889909086
65  0.274108522  2.000223288

OpenBLAS is very architecture-specific.

@johnmyleswhite
Copy link
Member

On my machine, I get performance more like Tim's:

4 0.021467 0.004708
5 0.065906 0.006636
8 0.064675 0.015061
9 0.064371 0.022904
16 0.071695 0.058209
17 0.066378 0.066214
32 0.079359 0.229914
33 0.079927 0.252575
64 0.148048 1.032274
65 0.116915 1.048045

@ViralBShah
Copy link
Member

Is this with OPENBLAS_NUM_THREADS=1?

@ViralBShah
Copy link
Member

Perhaps we should have some general infrastructure to deal with such cases to handle changeover from native julia to library implementations.

Kind of like multiple dispatch but based on problem size.

@StefanKarpinski
Copy link
Member

Yes! We can call it an "if statement".

@vharavy
Copy link
Contributor Author

vharavy commented May 30, 2013

Setting OPENBLAS_NUM_THREADS=1 produces much more likeable result:

N          OpenBLAS    Naive       MKL
4          0.029357    0.005750    0.011478
5          0.029628    0.008686    0.012789
8          0.030499    0.019367    0.013001
9          0.022988    0.024703    0.015096
16         0.029898    0.075955    0.019873
17         0.032563    0.085504    0.022977
32         0.059854    0.318140    0.041384
33         0.073469    0.340058    0.047291
64         0.180811    1.343092    0.145430
65         0.190066    1.418690    0.161348

@ViralBShah We could, but this should be a problem of OpenBLAS. MKL also uses threading, but it handles small matrices gracefully.

I suggest we analyzer performance of OpenBLAS on Windows a little further and if such behavior is a pattern we can disable threading at all (at least on Windows). Can anybody check if this problem exists on Linux/MacOS?

@bsxfan
Copy link

bsxfan commented May 30, 2013

On my (oldish, non-hyperthreading) machines, running fresh windows and linux 64-bit linux julia builds, OpenBlas gemv does not look too good either. Below is a slightly modified timing table, where I have replaced naive_A_mul_B with a cache friendly version that iterates for j=1:n, i=1:m y[i] += A[i,j] * x[j] end. Also, I have normalized the timings by dividing by n^2. I have also added a gemm-based MV multiplier. The number of openblas threads is in brackets.

The code I used is here: https://github.com/bsxfan/ad4julia/blob/master/timegemv.jl. (Compared to @vharavy's test code, I think my normalized timing for n=4 sufffers a bit from overhead to decide which of my MV functions to call, but for larger n, the division by n^2 makes this constant overhead insignificant.)

For my machines at least, I would conclude you don't want to be using any openblas-based matrix vector multiplication for n<60 or so and for larger n, multithreading doesn't hurt, but doesn't help very much either. (For big matrix-matrix multiplications however, the multithreading does help.)

Windows (2 cores):

   n   loop        gemv(1)   gemm(1)   gemv(2)     gemm(2) 
   4   16.85375   21.42875   49.59875   22.39188   45.50625
   5   10.63240   23.42240   37.59880  456.27040   38.06080
   8    7.52422    7.34344   15.10844  203.75234   13.36281
   9    5.23160    6.08765   12.31802  170.88185   12.41309
  16    3.95770    2.25723    4.51445   45.70129    4.30375
  17    3.82567    2.13277    4.19889   23.11398    3.82567
  32    2.88925    0.89160    1.76063   12.90381    1.91864
  33    2.84769    0.74994    1.82534   11.23155    1.76522
  64    2.86197    0.60945    1.65530    3.25981    1.60263
  65    4.16052    0.63734    1.59108    3.24143    1.55005
 128    2.86832    0.53115    1.42252    1.18128    1.44509
 129    2.93028    0.52665    1.43019    1.06581    1.56538
 512    2.84102    0.44971    1.37967    0.32553    1.37985
 513    2.84649    0.46067    1.37898    0.33463    1.36106
1024    3.16857    1.47960    1.97020    1.49662    2.03573
1025    3.14996    1.49318    2.18105    1.44015    2.12832
2048    3.13348    1.62384    2.15700    1.54804    2.13653
2049    3.13667    1.61306    2.25749    1.58398    2.30797

Linux (8 cores):

   n   loop        gemv(1)   gemm(1)   gemv(8)     gemm(8) 
   4   18.57250   21.28250   50.72438   23.55313   45.28875
   5   10.79400   12.19760   32.91320  136.93360   30.87360
   8    5.85875    4.84844   12.38969   51.08484   12.40016
   9    5.24605    4.10593   10.21679   48.62457   10.34642
  16    3.50402    1.60164    3.79465   17.43223    4.03980
  17    3.42273    1.54747    3.62574   15.74512    3.74419
  32    2.91198    0.77287    1.77106    6.30886    1.72822
  33    2.92479    0.74156    1.70529    5.43544    1.69363
  64    2.83866    0.53758    1.60033    1.63550    1.58147
  65    3.00864    0.57960    1.52095    1.56001    1.53420
 128    2.88911    0.53451    1.42644    0.53462    1.39886
 129    2.88510    0.53506    1.45560    0.50755    1.46066
 512    2.89669    0.45824    1.42252    0.17840    1.40695
 513    2.88091    0.47314    1.40404    0.17443    1.39493
1024    3.32143    1.88073    2.53437    1.81940    2.49573
1025    3.37001    1.90204    2.68743    0.13025    2.66368
2048    3.48342    2.06975    2.83885    1.61386    2.84011
2049    3.40828    2.08051    2.88406    1.62771    2.88558

@ViralBShah
Copy link
Member

Cc: @xianyi

@bsxfan
Copy link

bsxfan commented May 30, 2013

For a moment I thought my timings here were in contrast with this post (https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/s3_azLP6oyY) where I compared Julia's sum(A,2) with GEMV. There, for A, n x n, and n=1000, I found GEMV with 8 threads gave a 20-fold performance gain. On closer inspection, multithreaded GEMV seems to have this sweet spot for medium sized n, on my machine roughly 200 < n 1500, but after about 1500, the advantage deteriorates to just a factor 2 (similar to single-threaded GEMV).

Curiously, exactly at n=1024 there is a bad spot. Using my test code referred to above, I compare multithreaded gemv time relative to loopMV time (smaller than one means blas is better):

julia> for n=1020:1:1026;println(n,": ",test(gemvMV,n,100,8)/test(loopMV,n,100,1)) end
1020: 0.04129974380460607
1021: 0.03499813589895558
1022: 0.040023128676005285
1023: 0.03524801930325523
1024: 0.48713279239389196
1025: 0.03455542414148307
1026: 0.037994575168908744

At 1024 it takes more than 10 times longer than at 1023 or 1025.

(From the table in my previous post, n=5 is also a particularly bad spot for multithreaded GEMV.)

@lindahua
Copy link
Contributor

The performance hit at size 1024 doesn't seem to occur with MKL.

@lsorber
Copy link

lsorber commented Jun 17, 2013

Blaze looks like a very fast alternative to OpenBLAS.

@JeffBezanson
Copy link
Member

Blaze requires a separate BLAS to be installed.

@dmbates
Copy link
Member

dmbates commented Jun 17, 2013

As @JeffBezanson notes, Blaze ends up using BLAS at a low level, as does Armadillo. Other C++ templated linear algebra systems like Eigen create linear algebra classes that do not use the BLAS. In all these cases, however, the objective is to organize vector types, matrix types and various factorizations in some kind of an object-oriented framework so the programmer can write expressions in a natural way and still have them evaluated efficiently.

We are steadily building such a framework in Julia but with a big difference. Julia provided multiple dispatch whereas systems based on C++ must build methods into distinct classes. Many linear algebra calculations like products and solving linear systems are poster children for multiple dispatch. The decision of how to evaluate the expression should depend on both the left and the right operands. In Julia this is possible.

And the core developers have build marvelous parsing rules to provide efficient evaluation of many expressions like A'B. Accomplishing this using delayed evaluation techniques in languages like C++ ends up with some truly horrible looking code.

@lindahua
Copy link
Contributor

Frankly speaking, C++ template libraries have its advantages over Julia (at least at this moment).

For example, C++ template libraries can achieve lazy evaluation with zero runtime-overhead even with very complex expressions. A correctly implemented C++ template library can evaluate complex vectorized expression like r = sum(x.^2 + y.^2 + sin(z).^2) without creating any intermediate arrays, and map the computation to tight SIMD loops.

The way they achieve this is to use template-based generic programming (OOP is not actually used a lot in such libraries). So they can achieve multiple dispatch in a static sense. Such libraries could be tricky to implement and C++ meta-programming codes can be quite difficult to follow. However, the client codes that use the library are usually concise and friendly.

Having said that, Julia does have many other advantages though.

@StefanKarpinski
Copy link
Member

I would point out that all of that can also be done in Julia just as easily
as in C++ but we've chosen not to follow that route with the core Julia
types because of the complexity of implementation and because in my
experience it ends up being brittle and hard to understand in the sense
that everything is cool until it isn't – and then you're kind of screwed.
We need to figure out how to deal with this better in Julia without giving
up the simplicity and transparency that we've maintained so far.

On Mon, Jun 17, 2013 at 6:17 PM, Dahua Lin [email protected] wrote:

Frankly speaking, C++ template libraries have its advantages over Julia
(at least at this moment).

For example, C++ template libraries can achieve lazy evaluation with zero
runtime-overhead even with very complex expressions. A correctly
implemented C++ template library can evaluate complex vectorized expression
like r = sum(x.^2 + y.^2 + sin(z).^2) without creating any intermediate
arrays, and map the computation to tight SIMD loops.

The way they achieve this is to use template-based generic programming
(OOP is not actually used a lot in such libraries). So they can achieve multiple
dispatch
in a static sense. Such libraries could be tricky to implement
and C++ meta-programming codes can be quite difficult to follow. However,
the client codes that use the library are usually concise and friendly.

Having said that, Julia does have many other advantages though.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3239#issuecomment-19578878
.

@ViralBShah
Copy link
Member

@lindahua I was banking on your Devectorize package to deal with complex vectorized expressions. This is becoming a bottleneck in some of the stuff I want to do as well. I really wish we are able to do such things in base julia, and some of that stuff such as SIMD is on the compiler optimization roadmap (#3440).

@xianyi
Copy link

xianyi commented Jun 18, 2013

This is similiar to OpenMathLib/OpenBLAS#103.

@timholy
Copy link
Member

timholy commented Jun 18, 2013

As a concrete example of what Stefan was talking about (just came up in my own coding), lazy evaluation works great for simple cases but runs into trouble quickly. Compare something like r = A*x+b vs. r = A*(x+b) using preallocated output r. The first is pretty easy, but for the second you're best off introducing a temporary to store the result of x+b unless you're willing to destroy x. (Vector operations are non-aliasing, but matrix multiplication aliases, so in the second case you can't use r as a temporary).

@lindahua
Copy link
Contributor

@timholy You are right that special care need to be taken for such cases. A sophisticated matrix library typically comes with a large bunch of specialized template functions that dispatch the computation based on the type of expression. If expressions are constructed recursively, A * x + b and A * (x + b) will be of different types (e.g. BinaryExpr<Add, BinaryExpr<Times, Matrix, Vector>, Vector> and BinaryExpr<Times, BinaryExpr<Add, Vector, Vector> >).

The expressions are not always lazy evaluated, one can write specialized rules to evaluate part of the expression earlier than the others.

Having said that, I agree that things can quite complicated. C++ Boost has a library Proto, which allows you to express complex mapping between expression and computation in a relatively concise and systematic way.

The C++ library NT2 is a library (much more comprehensive than Eigen, Armadillo, or Blaze) that uses Proto to give a complete coverage of such things that try to map all kinds of expressions that one may ever conceive to the most efficient computation routines. The cost is that the library becomes very complex, with several hundreds of thousands lines of code (probably 3x - 4x larger than the Julia codebase), and the compilation cost is high -- the compiler needs to work through all rules (expressed in template meta-programing) to actually emit the code.

@lindahua
Copy link
Contributor

I am not suggesting to take such approach in Julia. I agree with @StefanKarpinski that this will become brittle and overly complex.

Instead I think the compilation optimization stuff that Jeff is working on may achieve similar goals in a cleaner way. I am looking forward to that.

@StefanKarpinski
Copy link
Member

My hypothesis is that most of the performance overhead can be eliminated without lazy evaluation via:

  1. Non-copyng array slices: make array slices views instead of copies; requires fast linear indexing, which I've posted about elsewhere but haven't gotten to make much progress on since then.
  2. Automatic free insertion in loops: if we can know when a temporary created in a loop can be immediately freed at the end of the loop body, then the next iteration can reuse that memory.
  3. Better syntax and support for in-place operations in the cases where you really can't afford to use a temporary.

I'm not positive about this, but I suspect that these three improvements together with other compiler work will get us to where we need to be without losing the nice clean eager, dynamic semantics that we currently have.

@vharavy
Copy link
Contributor Author

vharavy commented Jun 18, 2013

My opinion is that you all forget what this issue is about. We need to make sure that basic BLAS and LAPACK functions work as fast as possible (or at least not as slow as some of them work now).

Julia does not need a very complicated linear algebra like above mentioned Blaze, Proto, NT2 or Armadillo. Small optimizations like mentioned by @StefanKarpinski is what will make Julia's linear algebra fast and flexible.

@dmbates
Copy link
Member

dmbates commented Jun 18, 2013

Sorry about the earlier mess, I responded in email rather than writing here

@StefanKarpinski suggested "Better syntax and support for in-place operations in the cases where you really can't afford to use a temporary." and my response was

I was going to suggest this in a more general issue regarding further enhancements of linear algebra methods but, since you brought it up ...

I suggest adding methods for solve! to the factorization types and special matrix types. A solve! method is an in-place solve. That is, it overwrites the second argument (the right-hand side of the system) with the solution.

If this seems reasonable I will begin implementation.

@ViralBShah
Copy link
Member

@vharavy I agree that the point in this issue is lost in discussion. As @xianyi mentioned, the openblas issue with threading is known, and the cut-offs for using multiple threads need to be improved in openblas. If threads are disabled, you will get worse performance on larger problems.

For the time being, just like we have julia implementation for small matrix multiplication, we could extend that for small matvecs as well.

@timholy
Copy link
Member

timholy commented Jun 19, 2013

@dmbates, or you could write the three-argument version solve!(output, A, b).

@dmbates
Copy link
Member

dmbates commented Jun 20, 2013

@timholy For the most common cases where A is a factorization of a square matrix the three-argument version would be equivalent to solve!(A, copy!(output, b)). I guess it is a matter of taste which version is preferable.

This issue is related to an issue or thread by @johnmyleswhite regarding the names of mutating function variants like the three argument version of Ac_mul_B, but I can't find the issue now. I also had the impression that @lindahua mentioned something like this in his announcement of the NumericFunctors package, but I can't find that either. (The google is not strong in you today, my young apprentice.)

For function in base Julia at least there should be consistent naming conventions for the versions that mutate an input argument and those that overwrite an output argument in place. So for the case of solve

  • solve(A,b) returns a freshly-allocated array
  • solve!(A,b) overwrites b with the solution, returning the mutated version
  • solve!(output, A, b) overwrites output with the solution

I think that could be applied consistently to many of the other numerical linear algebra functions.

cc: @ViralBShah, @andreasnoackjensen

@jiahao
Copy link
Member

jiahao commented Mar 22, 2014

One small observation: when chatting with @stevengj yesterday, we discovered that OpenBLAS appears to use OpenMP's static scheduler in some core driver loop. Simply changing the scheduler directive to schedule(auto) and recompiling resulted in a 35% speedup for a very simple test problem:

julia> y=x=randn(10,10); @time begin for i=1:10^6; y=x*x; end; end; y[1] #With schedule(static) as in v0.2.8
elapsed time: 3.593052362 seconds (928000000 bytes allocated)

julia> y=x=randn(10,10); @time begin for i=1:10^6; y=x*x; end; end; y[1] #With schedule(auto) / v0.2.8
elapsed time: 2.638416476 seconds (928000000 bytes allocated)

julia> versioninfo()
Julia Version 0.3.0-prerelease+2145
Commit a9949c6* (2014-03-22 04:06 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.1.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

@jasax
Copy link

jasax commented Jun 1, 2014

Recently found that complex matrix multiplication gives wrong results, even with a 4x4 matrix, in a machine with Windows7-pro running on a AMD FX(tm)-8320 Eight-Core Processor. Error appears both in Julia 0.2.1 and 0.3-dev.

See issue JuliaLang/LinearAlgebra.jl#117.

@StefanKarpinski
Copy link
Member

Can you give an example? I'm really disturbed by what seem to be a lot of cases of OpenBLAS giving incorrect answers lately.

@tkelman
Copy link
Contributor

tkelman commented Jun 1, 2014

Crazy idea. Anyone want to try incorporating https://github.com/flame/blis and https://github.com/flame/libflame as alternate BLAS/LAPACK backends?

@ViralBShah
Copy link
Member

That is a lot of work. Is it blas compatible? How does the performance compare?

@ViralBShah
Copy link
Member

Read their FAQs. Major issue seems to be windows support. It does have a blas interface.

@tkelman
Copy link
Contributor

tkelman commented Jun 1, 2014

Would not expect it to be a quick job, no. More like an up-for-grabs experiment if someone's feeling adventurous. According to their publications (http://www.cs.utexas.edu/users/flame/pubs/BLISTOMS.pdf), serial performance is comparable but it's not clear to me whether they've implemented multithreading yet.

Edit: if the FAQ entry about not supporting building as a shared library still applies, then it may be a non-starter for now.

@jasax
Copy link

jasax commented Jun 1, 2014

In the thread I mentioned
(JuliaLang/LinearAlgebra.jl#117) is some test code. I
reply it below, with a sample output. More information and examples in
that thread.

I shall stress that I have 2 machines with Windows 7 64 bits, and this
only happens in the one which has an AMD 8320 8-core (the other one
with an AMD A8-3850 APU with 4 cores runs the code OK). Errors happen
both with julia 0.2.1 and 0.3-dev (details in the thread.)

Also, this happens only with complex matrices, not with real matrices.
Another clue is that I have ran the code in the AMD 8320, in Ubuntu
12.10 over Virtualbox, using Ubuntu's julia 0.2.1, and the the code
runs OK. This seems to be related with Windows7+8-core AMD, a chip
with a not so common number of processors (don't know if, or how, the
'*' routine is implemented in OpenBLAS using multi-threading and/or
multiple available cores...)

If someone has a similar setup (windows + amd 8-core), it would be
interesting to know the results of the test...

Below is the code and a sample result (I generate random matrices, the
error is always present). Note that the multiplication seems to work
with reals, but fails with complex matrices. I also use lu() to get a
couple of matrices, but the error is in the multiplication, not in
factoring. Also the vectors p (permutations in lu()) also don't
matter. Finally I compare native matrix mult with '*' and a simple
triple loop to do the same job. The error appears always with thw
random matrices, but I spotted the misbehavior when developing a small
program for simulating AC circuits (it is shown an example output in
the thread)

######## "*" OF RANDOM MATRICES HIGHLIGHTING THE ERROR ##########

N=4
M=rand(Float64,N,N)
L,U,p=lu(M)
LU=L*U
println(p)

LoopLU=zeros(Float64,N,N)

Product of L*U is looped

for i in 1:N
for j in 1:N
s=0.0+0.0im
for k in 1:N s += L[i,k]*U[k,j] end
LoopLU[i,j]=s
end
end
println(LU-LoopLU) # IT IS ZERO

MC=zeros(Complex{Float64},N,N)
MC = rand(Float64,N,N)+(0.0+1.0im)*rand(Float64,N,N) # complex random matrix

L,U,p=lu(MC)
LU=L*U
println(p)
LoopLU=zeros(Complex{Float64},N,N)

Product of L*U is looped

for i in 1:N
for j in 1:N
s=0.0+0.0im
for k in 1:N s += L[i,k]*U[k,j] end
LoopLU[i,j]=s
end
end
println(LU-LoopLU) # GIVES != ZERO!!!

########## RESULT in AMD 8320, julia 0.3-dev ####

[3,4,2,1]
[0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0]
[4,2,3,1]
Complex{Float64}[0.0 + 0.0im 0.0 + 0.0im 0.010724319095534562 -
0.00436327616604229im 0.05626670997775374 - 0.04487609382080637im
0.0 + 0.0im 0.0 + 1.1102230246251565e-16im
0.00031766699802464327 - 0.0007146125364691867im 0.028237510396673327

  • 0.02084902211094586im
    0.0 + 0.0im 0.0 + 0.0im 0.0719049982965248 -
    0.036089242268516175im 0.06891390625263594 - 0.05992628099413577im
    0.0 + 0.0im -5.551115123125783e-17 +
    2.7755575615628914e-17im 0.042979910736967 - 0.016978401082962447im
    0.04316866011620124 - 0.020865798633942556im]

On 6/1/14, Viral B. Shah [email protected] wrote:

That is a lot of work. Is it blas compatible? How does the performance
compare?


Reply to this email directly or view it on GitHub:
#3239 (comment)

@andyferris
Copy link
Member

andyferris commented Sep 1, 2016

This issue is something I noticed while developing StaticArrays. For very small sizes, there is a lot of overhead coming from Julia, let alone BLAS. For instance, there are specialized 2x2 and 3x3 matrix multiplication code, but after a very quick look through the code it passes through several functions repeatedly checking size matching, checking the character (!!) that describes if its transposed or not, etc before it gets to evaluation.

A lot of this run-time overhead could be eliminated with e.g. Val{'N'} instead of 'N' for the (lack of) transposition, and taking care to only check the dimensions once (and possibly even elided where possible by @inbounds, or something). Out of curiosity, is some of this on the roadmap? (e.g. lazy transposition, taking vector transposes slowly seriously, or whatever might force us to re-write the A_mul_Bc interface?)

@StefanKarpinski wrote:

2.. Automatic free insertion in loops: if we can know when a temporary created in a loop can be immediately freed at the end of the loop body, then the next iteration can reuse that memory.

That sounds pretty cool, too!

@GravityAssisted
Copy link

Sorry to resurrect an old thread, but I am also seeing performance issues for small Mat*Vec operations. Is there a fix on the horizon for this ? Just want to know whats the best way to solve the problem. I am happy with writing my own optimized matvec function but then I wonder what all other native functions should I re-write. Opens up a can of worms in my head :-)
Btw: We are taking 3x3 matrices and 3x1 vector operations, common when working with points in cartesian space.

I am on Julia 0.6 release using openblas.

julia> versioninfo()
Julia Version 0.6.1-pre.0
Commit dcf39a1 (2017-06-19 13:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

@andreasnoack
Copy link
Member

Please ask questions at discourse.julialang.org.

@GravityAssisted
Copy link

@andreasnoack Understood. Generally I would, but it was related to this thread so I asked it here.

@KristofferC
Copy link
Member

KristofferC commented Jul 11, 2017

Use StaticArrays.jl (they also have mutable versions).

@ViralBShah
Copy link
Member

ViralBShah commented May 25, 2018

StaticArrays.jl is the only real solution for small matvecs above as @KristofferC points out. I am closing this one since it is unlikely OpenBLAS is going to be able to do very much about it, and even if it does, StaticArrays.jl will still be much better.

@andyferris
Copy link
Member

I will note that I don't quite agree with that, Viral - there are problems here we can feasibly fix in LinearAlgebra even if the result will be slightly slower than with StaticArrays. E.g. last time I looked the dispatch pattern for matrix multiplication is quite deep and results in multiple re-checking of sizes and so-on. This would have been written differently if we thought of a native multiplication algorithm for small matrices with an override to call BLAS only for large cases (the code appeared to me to have the opposite philosophy and/or history - we aim to dispatch to BLAS and have a native/generic algorithm implemented later on as an afterthought).

@ViralBShah
Copy link
Member

My main reason to close this issue was that things won't change in OpenBLAS. Perhaps we should rename this issue for the LinearAlgebra fix and not necessarily be about the upstream issue then. I agree that it can at least be made better.

@ViralBShah
Copy link
Member

BTW, while matrix-matrix multiply has a deep dispatch pattern, I believe matrix-vector is not so deep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests