-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: cache oblivious linear algebra algorithms #6690
Conversation
Thank you so much for tackling this |
I think we could have a |
@Jutho Great work. I think it would be useful if all the matrix multiplication functions followed the BLAS convention. |
I fully agree (although I have apparently misplaced the tA and tB arguments in my gemmnew!). Note that these arguments are currently not yet being used anyway (I just wanted to check the proof of concept). |
This is really amazing. Will try it soon. |
Awesome! |
Well, don't become too enthusiastic too soon. The transpose! code is certainly working very well and is rather insensitive to the size of the base block, so I am currently getting equally good results with baselength down to 64, meaning something like 8x8 matrices, which is much smaller than the maximal size that would fit in level 1 cache. Matrix multiplication is more difficult. It is only competitive if you make the blocks so that you use level 1 cache more or less completely. There is a decrease in efficiency by making them smaller. So it is not really cache oblivious. For arbitrary strided matrices with all strides bigger than 1, you then still have to copy to preallocated storage etc. It is apparently known that cache oblivious matrix multiplication is harder and requires a really good microkernel for the base problem: |
Yes, matrix multiplication seems to be harder to do well. At the lowest level, the optimal base cases are actually non-square because of the difference in the number loads and stores. @matteo-frigo has some experience with optimizing cache-oblivious matrix multiplications and may be able to comment further. |
The cache oblivious The current implementation typically beats Using the following benchmark code: function timetranspose(D::Int)
D1=rand(div(D,2):D)
D2=rand(div(D,2):D)
A=randn(D1,D2)
gc()
B1=zeros(D2,D1)
t1=@elapsed for i=1:iceil(100000/D);Base.transpose!(B1,A);end
gc()
B2=zeros(D2,D1)
t2=@elapsed for i=1:iceil(100000/D);Base.transposeold!(B2,A);end
return (t1,t2)
end
timetranspose(20)
for D=[4,10,100,1000,10000]
println("$(timetranspose(D))")
end
println("-----------------")
blas_set_num_threads(1)
function timegemm(D::Int)
D1=rand(div(D,2):D)
D2=rand(div(D,2):D)
D3=rand(div(D,2):D)
A=randn(D1,D3)
B=randn(D3,D2)
gc()
C1=zeros(D1,D2)
t1=@elapsed for i=1:iceil(10000/D);Base.LinAlg.gemm!('N','N',1.0,A,B,0.0,C1);end
gc()
C2=zeros(D1,D2)
t2=@elapsed for i=1:iceil(10000/D);Base.LinAlg.generic_matmatmul!(C2,'N','N',A,B);end
gc()
C3=zeros(D1,D2)
t3=@elapsed for i=1:iceil(10000/D);Base.LinAlg.BLAS.gemm!('N','N',1.0,A,B,0.0,C3);end
return (t1,t2,t3)
end
timegemm(20)
for D=[4,10,100,1000]
println("$(timegemm(D))")
end these are some results:
although it is not always this pronounced, and sometimes there is no difference, or the old methods even win slightly. |
It would be worth checking if using |
This actually decreases the performance. I made some small further optimisations to the gemmkernel function. The performance in comparison to The big problem is that for |
After reading some papers on GotoBlas yesterday, I realised that the best approach to write the fallback gemm! for StridedMatrix is possibly to copy subblocks of the size of level 2 cache (instead of level 1 cache) to normal matrices and then use the BLAS kernels (the so called GEBP and GEBB), at least if the eltype is a BlasFloat. However, I don't know if these kernels are just abstractions of what OpenBlas is actually using, or whether there are actual corresponding C functions I can call. I was unable to decipher the OpenBlas code and to find corresponding C functions. Anybody who has a better understanding of this? |
An update on the current status: the cache oblivious approach works well for transpose / ctranspose. I don't manage to get it to work well for matrix multiplication, and don't have the expertise or time to further improve this and experiment with this. If there is an interest to just merge the work on |
You may find http://www.eecs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf interesting as far as the state of the art in cache-oblivious matrix multiplication |
Let's merge the transpose work. |
The file on matrix multiplication seems to be about parallellized matrix multiplication (either on a multicore machine with shared memory or even on a multinode setup with MPI). This could certainly be interesting to implement a more efficient matrix multiplication algorithm for SharedMatrix, but doesn't help my case very much (which corresponds to the case P=1 in algorithm 2). I will clean up the transpose work. |
Whoops, good point. Thought they were cache-oblivious in the single node case as well, evidently the base case of one processor is just using MKL there. |
I seem to have done something wrong trying to rebase this to the current master. My git skills are still kind of underdeveloped. Any help for restoring this? |
There may be better or other ways, but assuming you don't mind squashing the commits (which is probably a good idea), this should work: # from your branch
# create a diff
git diff master > patch.diff
git co master
# move your branch out of the way
git branch -m jh/cacheoblivious jh/cacheoblivious.old
# create and check out a new branch with the same name
git co -b jh/cacheoblivious
# patch the branch
patch -p1 < patch.diff
# double check
git diff master
# commit and push
git commit -a
git push -f YOUR_REPO jh/cacheoblivious |
For safety, |
Whoops! And the |
Thanks. I don't fully understand what happened before: according to the compare from jh/cacheoblivious in my repo to JuliaLang/julia master there was also only one file changed (array.jl), so I don't know how all these other commits got included in this pull request. But thanks to @kmsquire that's now fixed. Now let's wait for Travis and then this should be ready to merge. |
WIP: cache oblivious linear algebra algorithms
Should this be included in NEWS? |
@ViralBShah, for the new |
I was referring to the improved performance aspect. |
There are so many performance improvements and bug fixes in 0.3 that I doubt it is worth it to document them individually in the NEWS; it is better for the NEWS to focus on API changes. |
I believe this broke Grid timholy/Grid.jl#26 and LIBSVM JuliaML/LIBSVM.jl#5 due to changes to transposes e.g. for Grid.jl, now get
and LIBSVM.jl
Not sure if this was an intentional breaking change. |
I don't think this was intentional. Can you file an issue? |
Done |
Following the suggestion of @stevengj in JuliaLang/LinearAlgebra.jl#108, the purpose of this PR is to rewrite some of the linear algebra routines for matrices and vectors (transpose, generic_matmatmul, ...) using cache oblivious algorithms. The current commit is just a first start, none of the old methods have already been replaced.
Nevertheless,
tranposenew!
already outperformstranspose!
in many cases (while I have not yet seen it being slower).gemmnew!
is similar in speed (sometimes a few percent faster) thangeneric_matmatmul!
.Questions that arose will creating this: