WIP: cache oblivious linear algebra algorithms #6690

Jutho · 2014-04-29T19:42:27Z

Following the suggestion of @stevengj in JuliaLang/LinearAlgebra.jl#108, the purpose of this PR is to rewrite some of the linear algebra routines for matrices and vectors (transpose, generic_matmatmul, ...) using cache oblivious algorithms. The current commit is just a first start, none of the old methods have already been replaced.

Nevertheless, tranposenew! already outperforms transpose! in many cases (while I have not yet seen it being slower). gemmnew! is similar in speed (sometimes a few percent faster) than generic_matmatmul!.

Questions that arose will creating this:

I believe the best strategy is to have simple functions taking the name of the existing functions which do some simple size checking, and then passing the actual work to specialised recursive functions which should not be cluttering the namespace. What is the best way to hide them? Inside a new module? In a let block is out of the question, as this does not allow to use recursion without overhead.
For the recursive functions doing matrix matrix or matrix vector multiplication, it is anyhow necessary to have the possibility of adding the result of the multiplication to the output instead of just overwriting the output. I would think this validates the introduction of new methods gemm! and gemv! which follow the BLAS calling convention. generic_matmatmul! and generic_matvecmul! could then simply call these functions with specific value of the scalar coefficients.

jiahao · 2014-04-29T19:45:12Z

Thank you so much for tackling this

Jutho · 2014-04-29T19:46:18Z

The current transpose! with blocking only exists for arguments of type Matrix with the same element type; all other cases are handled with a naive algorithm and by explicitly creating a new matrix. ctranspose! does not exist at all. generic_matmatmul on the other hand works for all AbstractMatrix.

I think we could have a transpose! and ctranspose! for all AbstractMatrix or at least StridedMatrix types, even with different element type in source and destination.

andreasnoack · 2014-04-29T19:57:52Z

@Jutho Great work. I think it would be useful if all the matrix multiplication functions followed the BLAS convention.

Jutho · 2014-04-29T20:01:51Z

I fully agree (although I have apparently misplaced the tA and tB arguments in my gemmnew!). Note that these arguments are currently not yet being used anyway (I just wanted to check the proof of concept).

ViralBShah · 2014-04-30T04:12:51Z

This is really amazing. Will try it soon.

mlubin · 2014-04-30T19:07:28Z

Awesome!

Jutho · 2014-04-30T20:08:13Z

Well, don't become too enthusiastic too soon. The transpose! code is certainly working very well and is rather insensitive to the size of the base block, so I am currently getting equally good results with baselength down to 64, meaning something like 8x8 matrices, which is much smaller than the maximal size that would fit in level 1 cache.

Matrix multiplication is more difficult. It is only competitive if you make the blocks so that you use level 1 cache more or less completely. There is a decrease in efficiency by making them smaller. So it is not really cache oblivious. For arbitrary strided matrices with all strides bigger than 1, you then still have to copy to preallocated storage etc. It is apparently known that cache oblivious matrix multiplication is harder and requires a really good microkernel for the base problem:
http://www.cs.utexas.edu/~pingali/CS395T/2013fa/papers/co.pdf

stevengj · 2014-04-30T21:24:11Z

Yes, matrix multiplication seems to be harder to do well. At the lowest level, the optimal base cases are actually non-square because of the difference in the number loads and stores.

@matteo-frigo has some experience with optimizing cache-oblivious matrix multiplications and may be able to comment further.

Jutho · 2014-05-01T18:01:41Z

The cache oblivious gemm! efficiency is starting to improve. It is properly cache oblivious in the sense that the base size is smaller than the l1 cache (16x24 times 24x16 seems like a pretty good choice) for typical sizes of eltype (i.e. 64 bytes). But I do use some cache information cause I still have to preallocate some buffers for A, B and C that I copy to, to make this work efficiently for StridedMatrix.

The current implementation typically beats generic_matmatmul! for normal Matrix type, although that's not so useful since BLAS.gemm! is of course 10 times faster even when single threaded. For StridedMatrix the code in generic_matmatmul! still seems to win, although I need to test this further.

Using the following benchmark code:

function timetranspose(D::Int)
    D1=rand(div(D,2):D)
    D2=rand(div(D,2):D)
    A=randn(D1,D2)
    gc()
    B1=zeros(D2,D1)
    t1=@elapsed for i=1:iceil(100000/D);Base.transpose!(B1,A);end
    gc()
    B2=zeros(D2,D1)
    t2=@elapsed for i=1:iceil(100000/D);Base.transposeold!(B2,A);end
    return (t1,t2)
end
timetranspose(20)
for D=[4,10,100,1000,10000]
    println("$(timetranspose(D))")
end
println("-----------------")
blas_set_num_threads(1)
function timegemm(D::Int)
    D1=rand(div(D,2):D)
    D2=rand(div(D,2):D)
    D3=rand(div(D,2):D)
    A=randn(D1,D3)
    B=randn(D3,D2)
    gc()
    C1=zeros(D1,D2)
    t1=@elapsed for i=1:iceil(10000/D);Base.LinAlg.gemm!('N','N',1.0,A,B,0.0,C1);end
    gc()
    C2=zeros(D1,D2)
    t2=@elapsed for i=1:iceil(10000/D);Base.LinAlg.generic_matmatmul!(C2,'N','N',A,B);end
    gc()
    C3=zeros(D1,D2)
    t3=@elapsed for i=1:iceil(10000/D);Base.LinAlg.BLAS.gemm!('N','N',1.0,A,B,0.0,C3);end
    return (t1,t2,t3)
end
timegemm(20)
for D=[4,10,100,1000]
    println("$(timegemm(D))")
end

these are some results:

(0.004326721,0.005906504)
(0.002054528,0.003003527)
(0.006999849,0.006236851)
(0.139079119,0.136908195)
(1.872910761,2.008261756)
-----------------
(0.000640357,0.000687878,0.000569306)
(0.000556341,0.000717156,0.000423223)
(0.023904473,0.02656686,0.002876991)
(3.007262065,3.512027063,0.279214916)

although it is not always this pronounced, and sometimes there is no difference, or the old methods even win slightly.

ViralBShah · 2014-05-02T05:34:16Z

It would be worth checking if using @simd in the base case yields higher performance.

Jutho · 2014-05-02T08:54:23Z

This actually decreases the performance.

I made some small further optimisations to the gemmkernel function. The performance in comparison to generic_matmatmul! for normal Matrix types (still with M=N=16, P=24) varies between 80% - 100% (so faster), especially for Float64 elements. For the other types, the speed is typically identical. I do not understand where this difference is coming from.

The big problem is that for StridedMatrix (obtained via sub) the performance drops down to about 130 - 150% (i.e. 30 - 50% slower), and even by tuning M, N and P so as to use complete level 1 cache I am not able to improve this very much. I also tried to further optimise my copying step but with little effect.

Jutho · 2014-05-02T17:26:37Z

After reading some papers on GotoBlas yesterday, I realised that the best approach to write the fallback gemm! for StridedMatrix is possibly to copy subblocks of the size of level 2 cache (instead of level 1 cache) to normal matrices and then use the BLAS kernels (the so called GEBP and GEBB), at least if the eltype is a BlasFloat. However, I don't know if these kernels are just abstractions of what OpenBlas is actually using, or whether there are actual corresponding C functions I can call. I was unable to decipher the OpenBlas code and to find corresponding C functions. Anybody who has a better understanding of this?

Jutho · 2014-05-22T15:24:47Z

An update on the current status: the cache oblivious approach works well for transpose / ctranspose. I don't manage to get it to work well for matrix multiplication, and don't have the expertise or time to further improve this and experiment with this. If there is an interest to just merge the work on transpose! and currently leave generic_matmatmul unaltered, I can push a final commit removing the transposeold!.

tkelman · 2014-05-26T08:23:49Z

You may find http://www.eecs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf interesting as far as the state of the art in cache-oblivious matrix multiplication

ViralBShah · 2014-05-26T08:41:46Z

Let's merge the transpose work.

Jutho · 2014-05-26T17:44:09Z

The file on matrix multiplication seems to be about parallellized matrix multiplication (either on a multicore machine with shared memory or even on a multinode setup with MPI). This could certainly be interesting to implement a more efficient matrix multiplication algorithm for SharedMatrix, but doesn't help my case very much (which corresponds to the case P=1 in algorithm 2).

I will clean up the transpose work.

tkelman · 2014-05-26T17:59:18Z

Whoops, good point. Thought they were cache-oblivious in the single node case as well, evidently the base case of one processor is just using MKL there.

Jutho · 2014-05-26T19:32:27Z

I seem to have done something wrong trying to rebase this to the current master. My git skills are still kind of underdeveloped. Any help for restoring this?

kmsquire · 2014-05-26T20:41:03Z

There may be better or other ways, but assuming you don't mind squashing the commits (which is probably a good idea), this should work:

# from your branch
# create a diff
git diff master > patch.diff
git co master
# move your branch out of the way
git branch -m jh/cacheoblivious jh/cacheoblivious.old
# create and check out a new branch with the same name
git co -b jh/cacheoblivious
# patch the branch
patch -p1 < patch.diff
# double check
git diff master
# commit and push
git commit -a
git push -f YOUR_REPO jh/cacheoblivious

kmsquire · 2014-05-26T20:42:25Z

For safety,
git diff master >> patch.diff
should be
git diff master > patch.diff
(Updated above.)

kmsquire · 2014-05-26T20:43:24Z

Whoops! And the < doesn't belong in the last line:
git push -f YOUR_REPO jh/cacheoblivious

…ucturing

Jutho · 2014-05-26T20:50:26Z

Thanks. I don't fully understand what happened before: according to the compare from jh/cacheoblivious in my repo to JuliaLang/julia master there was also only one file changed (array.jl), so I don't know how all these other commits got included in this pull request. But thanks to @kmsquire that's now fixed.

Now let's wait for Travis and then this should be ready to merge.

WIP: cache oblivious linear algebra algorithms

ViralBShah · 2014-05-27T04:59:11Z

Should this be included in NEWS?

stevengj · 2014-05-27T12:19:58Z

@ViralBShah, for the new ctranspose! function you mean?

ViralBShah · 2014-05-27T13:23:25Z

I was referring to the improved performance aspect.

stevengj · 2014-05-27T14:13:18Z

There are so many performance improvements and bug fixes in 0.3 that I doubt it is worth it to document them individually in the NEWS; it is better for the NEWS to focus on API changes.

IainNZ · 2014-05-27T16:01:26Z

I believe this broke Grid timholy/Grid.jl#26 and LIBSVM JuliaML/LIBSVM.jl#5 due to changes to transposes

e.g. for Grid.jl, now get

ERROR: no method ctranspose(Array{Array{Float64,1},2})
 in include at boot.jl:244 (repeats 2 times)
 in include_from_node1 at loading.jl:128

and LIBSVM.jl

ERROR: no method ctranspose(Array{Any,2})
 in include at boot.jl:244
 in include_from_node1 at loading.jl:128

Not sure if this was an intentional breaking change.

stevengj · 2014-05-27T16:06:00Z

I don't think this was intentional. Can you file an issue?

IainNZ · 2014-05-27T17:09:40Z

Done

stevengj referenced this pull request May 14, 2014

Add Transpose immutable

37ebe9b

cache-oblivious implementation of transpose and ctranspose plus restr…

5bc96c7

…ucturing

stevengj added a commit that referenced this pull request May 26, 2014

Merge pull request #6690 from Jutho/jh/cacheoblivious

57b18ef

WIP: cache oblivious linear algebra algorithms

stevengj merged commit 57b18ef into JuliaLang:master May 26, 2014

IainNZ mentioned this pull request May 27, 2014

Possible error introduced by PR #6690 #6996

Closed

Jutho mentioned this pull request Jul 13, 2014

More efficient permutedims #6517

Closed

Jutho deleted the jh/cacheoblivious branch July 14, 2014 13:27

Jutho mentioned this pull request Aug 2, 2014

export transpose! and ctranspose! #7814

Merged

stevengj mentioned this pull request Jun 7, 2023

ENH: Improve performance for copying/relayout of large data numpy/numpy#21655

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: cache oblivious linear algebra algorithms #6690

WIP: cache oblivious linear algebra algorithms #6690

Jutho commented Apr 29, 2014

jiahao commented Apr 29, 2014

Jutho commented Apr 29, 2014

andreasnoack commented Apr 29, 2014

Jutho commented Apr 29, 2014

ViralBShah commented Apr 30, 2014

mlubin commented Apr 30, 2014

Jutho commented Apr 30, 2014

stevengj commented Apr 30, 2014

Jutho commented May 1, 2014

ViralBShah commented May 2, 2014

Jutho commented May 2, 2014

Jutho commented May 2, 2014

Jutho commented May 22, 2014

tkelman commented May 26, 2014

ViralBShah commented May 26, 2014

Jutho commented May 26, 2014

tkelman commented May 26, 2014

Jutho commented May 26, 2014

kmsquire commented May 26, 2014

kmsquire commented May 26, 2014

kmsquire commented May 26, 2014

Jutho commented May 26, 2014

ViralBShah commented May 27, 2014

stevengj commented May 27, 2014

ViralBShah commented May 27, 2014

stevengj commented May 27, 2014

IainNZ commented May 27, 2014

stevengj commented May 27, 2014

IainNZ commented May 27, 2014

WIP: cache oblivious linear algebra algorithms #6690

WIP: cache oblivious linear algebra algorithms #6690

Conversation

Jutho commented Apr 29, 2014

jiahao commented Apr 29, 2014

Jutho commented Apr 29, 2014

andreasnoack commented Apr 29, 2014

Jutho commented Apr 29, 2014

ViralBShah commented Apr 30, 2014

mlubin commented Apr 30, 2014

Jutho commented Apr 30, 2014

stevengj commented Apr 30, 2014

Jutho commented May 1, 2014

ViralBShah commented May 2, 2014

Jutho commented May 2, 2014

Jutho commented May 2, 2014

Jutho commented May 22, 2014

tkelman commented May 26, 2014

ViralBShah commented May 26, 2014

Jutho commented May 26, 2014

tkelman commented May 26, 2014

Jutho commented May 26, 2014

kmsquire commented May 26, 2014

kmsquire commented May 26, 2014

kmsquire commented May 26, 2014

Jutho commented May 26, 2014

ViralBShah commented May 27, 2014

stevengj commented May 27, 2014

ViralBShah commented May 27, 2014

stevengj commented May 27, 2014

IainNZ commented May 27, 2014

stevengj commented May 27, 2014

IainNZ commented May 27, 2014