Skip to content
hrautila edited this page Jan 22, 2013 · 5 revisions

The code in this repository is the result of a diversion into the world of matrix multiplication. It all started when I noted that the original matrix-matrix multiplication code in the matrix repository was quite bad, it had to be. So I measured it. And it performed at whopping 0.2 GFlops on my Lenovo W500 laptop. As reference I used the ATLAS library GEMM function that is available via the linalg/blas package. And it is executing at around 5.0 GFlops.

So I started working on my code to improve its performance. And the results of the more than few iterations on my code are the quite low level functions in calgo directory. They are mostly written in C and interfaced to GO with cgo. The functions come in two flavors - aligned and unaligned access. Aligned means that matrix data is aligned to 16 bytes, matrix dimensions are all even and the implementation does not copy any data to temporary buffer. Unaligned allows odd dimensions that cause some data not to be aligned at 16 bytes. Therefore they copy data to intermediate buffers.

Here is couple of performance graphs run on my laptop. A Lenovo W500 (Core2Duo T9600 @ 2.8Ghz). It looks reasonably good.

https://github.com/hrautila/matops/blob/master/test/notrans.png

https://github.com/hrautila/matops/blob/master/test/transa.png

It looks better than it actually is. The ATLAS library is optimized at compile time to the hardware it is compiled on. I have standard Ubuntu and I would guess that it is compiled on something else than a laptop. And actually, if I run the tests on Lenovo S20 workstation, the ATLAS Gemm version performs better than my code. As it should, I think.

One idea in writing the interface was that it should be easily parallelizable with go-routines. Even though go-outines is a concurrency mechanism, the runtime executes them in separate threads if GOMAXPROCS is set properly. Below is graph on the matrix multiplication with multiple cores compared to single threaded Gemm. It was run on a Lenovo S20 with 4 core Xeon CPU.

https://github.com/hrautila/matops/blob/master/test/parallel.png

Clone this wiki locally