The initial implementation of matrix multiplication is in a simple triple-loop way, even not considering cache-friendly array indexing. The code looks like
for j in 0..n { // every column of B
for i in 0..m { // every row of A
for k in 0..k {
C[i][j] += A[i][k] * B[k][j] // indexing B by every row makes cache miss a lot
}
}
}
I modified the indexing order a little when I was going to optimize its outrageously slow speed...
for i in 0..m {
for k in 0..k {
for j in 0..n {
C[i][j] += A[i][k] * B[k][j] // no cache miss now
}
}
}
The reordering makes it about 2x faster (test size is 512^3).
Then, I found the rust compiler could usually apply auto-vectorizing optimization to iterations easily by SIMD, so I converted the indexing to iterating and got about 40 x faster.
The permute operation involves the reordering of the actual data, so at least you have to traverse all elements in the ndarray, which indicates a