forked from numpy/numpy
-
-
Notifications
You must be signed in to change notification settings - Fork 0
Comparing matmul as a ufunc, einsum and dot
Matti Picus edited this page Jun 6, 2018
·
2 revisions
Did a little benchmarking with three approaches to multiplying matrices:
from __future__ import print_function
import perf
runner = perf.Runner()
runner.timeit("dot ", stmt="np.dot(a,a)",
setup="import numpy as np; a = np.random.random((3,3))",
)
runner.timeit("matmul", stmt="np.matmul(a,a)",
setup="import numpy as np; a = np.random.random((3,3))",
)
runner.timeit("einsum", stmt="np.einsum('ij,jk->ik', a, a)",
setup="import numpy as np; a = np.random.random((3,3))",
)
The results on my machine are:
.....................
dot: Mean +- std dev: 623 ns +- 42 ns
.....................
matmul: Mean +- std dev: 1.16 us +- 0.07 us
.....................
einsum: Mean +- std dev: 1.68 us +- 0.10 us
So dot is faster. I dived into possible reasons, using
valgrind --tool=callgrind python -c"import numpy as np; a = np.random.random((3,3)); b = [np.matmul(a, a) for i in range(1000)]; print(b[0])"
valgrind --tool=callgrind python -c"import numpy as np; a = np.random.random((3,3)); b = [np.dot(a, a) for i in range(1000)]; print(b[0])"
and came up with the following kcachegrind
analysis for matmul (click on the image to be able to read the fine print)
and for dot
This was after recompiling with CFLAGS='-O0 -g'
, but the cpu instruction counts reflect the timings:
PyUFunc_GeneralizedFuncion
has twice as many as cblas_matrixproduct
, even though gemm
(for this small 2d array) has more calls than DOUBLE_matmul
.
Thanks to the Moore and Sloan Foundations and BIDS for allowing me this opportunity