Skip to content

Comparing matmul as a ufunc, einsum and dot

Matti Picus edited this page Jun 6, 2018 · 2 revisions

Did a little benchmarking with three approaches to multiplying matrices:

from __future__ import print_function
import perf

runner = perf.Runner()
runner.timeit("dot   ", stmt="np.dot(a,a)",
              setup="import numpy as np; a = np.random.random((3,3))",
            )
runner.timeit("matmul", stmt="np.matmul(a,a)",
              setup="import numpy as np; a = np.random.random((3,3))",
            )
runner.timeit("einsum", stmt="np.einsum('ij,jk->ik', a, a)",
              setup="import numpy as np; a = np.random.random((3,3))",
            )

The results on my machine are:

.....................
dot: Mean +- std dev: 623 ns +- 42 ns
.....................
matmul: Mean +- std dev: 1.16 us +- 0.07 us
.....................
einsum: Mean +- std dev: 1.68 us +- 0.10 us

So dot is faster. I dived into possible reasons, using

valgrind --tool=callgrind python -c"import numpy as np; a = np.random.random((3,3)); b = [np.matmul(a, a) for i in range(1000)]; print(b[0])"
valgrind --tool=callgrind python -c"import numpy as np; a = np.random.random((3,3)); b = [np.dot(a, a) for i in range(1000)]; print(b[0])"

and came up with the following kcachegrind analysis for matmul (click on the image to be able to read the fine print) matmul as  a ufunc and for dot kcachegrind_dot

This was after recompiling with CFLAGS='-O0 -g', but the cpu instruction counts reflect the timings: PyUFunc_GeneralizedFuncion has twice as many as cblas_matrixproduct, even though gemm (for this small 2d array) has more calls than DOUBLE_matmul.

Clone this wiki locally