Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: cache oblivious linear algebra algorithms #6690

Merged
merged 1 commit into from
May 26, 2014

Conversation

Jutho
Copy link
Contributor

@Jutho Jutho commented Apr 29, 2014

Following the suggestion of @stevengj in JuliaLang/LinearAlgebra.jl#108, the purpose of this PR is to rewrite some of the linear algebra routines for matrices and vectors (transpose, generic_matmatmul, ...) using cache oblivious algorithms. The current commit is just a first start, none of the old methods have already been replaced.

Nevertheless, tranposenew! already outperforms transpose! in many cases (while I have not yet seen it being slower). gemmnew! is similar in speed (sometimes a few percent faster) than generic_matmatmul!.

Questions that arose will creating this:

  1. I believe the best strategy is to have simple functions taking the name of the existing functions which do some simple size checking, and then passing the actual work to specialised recursive functions which should not be cluttering the namespace. What is the best way to hide them? Inside a new module? In a let block is out of the question, as this does not allow to use recursion without overhead.
  2. For the recursive functions doing matrix matrix or matrix vector multiplication, it is anyhow necessary to have the possibility of adding the result of the multiplication to the output instead of just overwriting the output. I would think this validates the introduction of new methods gemm! and gemv! which follow the BLAS calling convention. generic_matmatmul! and generic_matvecmul! could then simply call these functions with specific value of the scalar coefficients.

@jiahao
Copy link
Member

jiahao commented Apr 29, 2014

Thank you so much for tackling this

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2014

  1. The current transpose! with blocking only exists for arguments of type Matrix with the same element type; all other cases are handled with a naive algorithm and by explicitly creating a new matrix. ctranspose! does not exist at all. generic_matmatmul on the other hand works for all AbstractMatrix.

I think we could have a transpose! and ctranspose! for all AbstractMatrix or at least StridedMatrix types, even with different element type in source and destination.

@andreasnoack
Copy link
Member

@Jutho Great work. I think it would be useful if all the matrix multiplication functions followed the BLAS convention.

@Jutho
Copy link
Contributor Author

Jutho commented Apr 29, 2014

I fully agree (although I have apparently misplaced the tA and tB arguments in my gemmnew!). Note that these arguments are currently not yet being used anyway (I just wanted to check the proof of concept).

@ViralBShah
Copy link
Member

This is really amazing. Will try it soon.

@mlubin
Copy link
Member

mlubin commented Apr 30, 2014

Awesome!

@Jutho
Copy link
Contributor Author

Jutho commented Apr 30, 2014

Well, don't become too enthusiastic too soon. The transpose! code is certainly working very well and is rather insensitive to the size of the base block, so I am currently getting equally good results with baselength down to 64, meaning something like 8x8 matrices, which is much smaller than the maximal size that would fit in level 1 cache.

Matrix multiplication is more difficult. It is only competitive if you make the blocks so that you use level 1 cache more or less completely. There is a decrease in efficiency by making them smaller. So it is not really cache oblivious. For arbitrary strided matrices with all strides bigger than 1, you then still have to copy to preallocated storage etc. It is apparently known that cache oblivious matrix multiplication is harder and requires a really good microkernel for the base problem:
http://www.cs.utexas.edu/~pingali/CS395T/2013fa/papers/co.pdf

@stevengj
Copy link
Member

Yes, matrix multiplication seems to be harder to do well. At the lowest level, the optimal base cases are actually non-square because of the difference in the number loads and stores.

@matteo-frigo has some experience with optimizing cache-oblivious matrix multiplications and may be able to comment further.

@Jutho
Copy link
Contributor Author

Jutho commented May 1, 2014

The cache oblivious gemm! efficiency is starting to improve. It is properly cache oblivious in the sense that the base size is smaller than the l1 cache (16x24 times 24x16 seems like a pretty good choice) for typical sizes of eltype (i.e. 64 bytes). But I do use some cache information cause I still have to preallocate some buffers for A, B and C that I copy to, to make this work efficiently for StridedMatrix.

The current implementation typically beats generic_matmatmul! for normal Matrix type, although that's not so useful since BLAS.gemm! is of course 10 times faster even when single threaded. For StridedMatrix the code in generic_matmatmul! still seems to win, although I need to test this further.

Using the following benchmark code:

function timetranspose(D::Int)
    D1=rand(div(D,2):D)
    D2=rand(div(D,2):D)
    A=randn(D1,D2)
    gc()
    B1=zeros(D2,D1)
    t1=@elapsed for i=1:iceil(100000/D);Base.transpose!(B1,A);end
    gc()
    B2=zeros(D2,D1)
    t2=@elapsed for i=1:iceil(100000/D);Base.transposeold!(B2,A);end
    return (t1,t2)
end
timetranspose(20)
for D=[4,10,100,1000,10000]
    println("$(timetranspose(D))")
end
println("-----------------")
blas_set_num_threads(1)
function timegemm(D::Int)
    D1=rand(div(D,2):D)
    D2=rand(div(D,2):D)
    D3=rand(div(D,2):D)
    A=randn(D1,D3)
    B=randn(D3,D2)
    gc()
    C1=zeros(D1,D2)
    t1=@elapsed for i=1:iceil(10000/D);Base.LinAlg.gemm!('N','N',1.0,A,B,0.0,C1);end
    gc()
    C2=zeros(D1,D2)
    t2=@elapsed for i=1:iceil(10000/D);Base.LinAlg.generic_matmatmul!(C2,'N','N',A,B);end
    gc()
    C3=zeros(D1,D2)
    t3=@elapsed for i=1:iceil(10000/D);Base.LinAlg.BLAS.gemm!('N','N',1.0,A,B,0.0,C3);end
    return (t1,t2,t3)
end
timegemm(20)
for D=[4,10,100,1000]
    println("$(timegemm(D))")
end

these are some results:

(0.004326721,0.005906504)
(0.002054528,0.003003527)
(0.006999849,0.006236851)
(0.139079119,0.136908195)
(1.872910761,2.008261756)
-----------------
(0.000640357,0.000687878,0.000569306)
(0.000556341,0.000717156,0.000423223)
(0.023904473,0.02656686,0.002876991)
(3.007262065,3.512027063,0.279214916)

although it is not always this pronounced, and sometimes there is no difference, or the old methods even win slightly.

@ViralBShah
Copy link
Member

It would be worth checking if using @simd in the base case yields higher performance.

@Jutho
Copy link
Contributor Author

Jutho commented May 2, 2014

This actually decreases the performance.

I made some small further optimisations to the gemmkernel function. The performance in comparison to generic_matmatmul! for normal Matrix types (still with M=N=16, P=24) varies between 80% - 100% (so faster), especially for Float64 elements. For the other types, the speed is typically identical. I do not understand where this difference is coming from.

The big problem is that for StridedMatrix (obtained via sub) the performance drops down to about 130 - 150% (i.e. 30 - 50% slower), and even by tuning M, N and P so as to use complete level 1 cache I am not able to improve this very much. I also tried to further optimise my copying step but with little effect.

@Jutho
Copy link
Contributor Author

Jutho commented May 2, 2014

After reading some papers on GotoBlas yesterday, I realised that the best approach to write the fallback gemm! for StridedMatrix is possibly to copy subblocks of the size of level 2 cache (instead of level 1 cache) to normal matrices and then use the BLAS kernels (the so called GEBP and GEBB), at least if the eltype is a BlasFloat. However, I don't know if these kernels are just abstractions of what OpenBlas is actually using, or whether there are actual corresponding C functions I can call. I was unable to decipher the OpenBlas code and to find corresponding C functions. Anybody who has a better understanding of this?

@Jutho
Copy link
Contributor Author

Jutho commented May 22, 2014

An update on the current status: the cache oblivious approach works well for transpose / ctranspose. I don't manage to get it to work well for matrix multiplication, and don't have the expertise or time to further improve this and experiment with this. If there is an interest to just merge the work on transpose! and currently leave generic_matmatmul unaltered, I can push a final commit removing the transposeold!.

@tkelman
Copy link
Contributor

tkelman commented May 26, 2014

You may find http://www.eecs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf interesting as far as the state of the art in cache-oblivious matrix multiplication

@ViralBShah
Copy link
Member

Let's merge the transpose work.

@Jutho
Copy link
Contributor Author

Jutho commented May 26, 2014

The file on matrix multiplication seems to be about parallellized matrix multiplication (either on a multicore machine with shared memory or even on a multinode setup with MPI). This could certainly be interesting to implement a more efficient matrix multiplication algorithm for SharedMatrix, but doesn't help my case very much (which corresponds to the case P=1 in algorithm 2).

I will clean up the transpose work.

@tkelman
Copy link
Contributor

tkelman commented May 26, 2014

Whoops, good point. Thought they were cache-oblivious in the single node case as well, evidently the base case of one processor is just using MKL there.

@Jutho
Copy link
Contributor Author

Jutho commented May 26, 2014

I seem to have done something wrong trying to rebase this to the current master. My git skills are still kind of underdeveloped. Any help for restoring this?

@kmsquire
Copy link
Member

There may be better or other ways, but assuming you don't mind squashing the commits (which is probably a good idea), this should work:

# from your branch
# create a diff
git diff master > patch.diff
git co master
# move your branch out of the way
git branch -m jh/cacheoblivious jh/cacheoblivious.old
# create and check out a new branch with the same name
git co -b jh/cacheoblivious
# patch the branch
patch -p1 < patch.diff
# double check
git diff master
# commit and push
git commit -a
git push -f YOUR_REPO jh/cacheoblivious

@kmsquire
Copy link
Member

For safety,
git diff master >> patch.diff
should be
git diff master > patch.diff
(Updated above.)

@kmsquire
Copy link
Member

Whoops! And the < doesn't belong in the last line:
git push -f YOUR_REPO jh/cacheoblivious

@Jutho
Copy link
Contributor Author

Jutho commented May 26, 2014

Thanks. I don't fully understand what happened before: according to the compare from jh/cacheoblivious in my repo to JuliaLang/julia master there was also only one file changed (array.jl), so I don't know how all these other commits got included in this pull request. But thanks to @kmsquire that's now fixed.

Now let's wait for Travis and then this should be ready to merge.

stevengj added a commit that referenced this pull request May 26, 2014
WIP: cache oblivious linear algebra algorithms
@stevengj stevengj merged commit 57b18ef into JuliaLang:master May 26, 2014
@ViralBShah
Copy link
Member

Should this be included in NEWS?

@stevengj
Copy link
Member

@ViralBShah, for the new ctranspose! function you mean?

@ViralBShah
Copy link
Member

I was referring to the improved performance aspect.

@stevengj
Copy link
Member

There are so many performance improvements and bug fixes in 0.3 that I doubt it is worth it to document them individually in the NEWS; it is better for the NEWS to focus on API changes.

@IainNZ
Copy link
Member

IainNZ commented May 27, 2014

I believe this broke Grid timholy/Grid.jl#26 and LIBSVM JuliaML/LIBSVM.jl#5 due to changes to transposes

e.g. for Grid.jl, now get

ERROR: no method ctranspose(Array{Array{Float64,1},2})
 in include at boot.jl:244 (repeats 2 times)
 in include_from_node1 at loading.jl:128

and LIBSVM.jl

ERROR: no method ctranspose(Array{Any,2})
 in include at boot.jl:244
 in include_from_node1 at loading.jl:128

Not sure if this was an intentional breaking change.

@stevengj
Copy link
Member

I don't think this was intentional. Can you file an issue?

@IainNZ
Copy link
Member

IainNZ commented May 27, 2014

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants