Trivial Details...

Matrix Multiplication

The initial implementation of matrix multiplication is in a simple triple-loop way, even not considering cache-friendly array indexing. The code looks like

for j in 0..n { // every column of B
    for i in 0..m { // every row of A
        for k in 0..k {
            C[i][j] += A[i][k] * B[k][j] // indexing B by every row makes cache miss a lot
        }
    }
}

I modified the indexing order a little when I was going to optimize its outrageously slow speed...

for i in 0..m {
    for k in 0..k {
    	for j in 0..n {
            C[i][j] += A[i][k] * B[k][j] // no cache miss now
        }
    }
}

The reordering makes it about 2x faster (test size is 512^3).

Then, I found the rust compiler could usually apply auto-vectorizing optimization to iterations easily by SIMD, so I converted the indexing to iterating and got about 40 x faster.

NdArray Permute

The permute operation involves the reordering of the actual data, so at least you have to traverse all elements in the ndarray, which indicates a $O(N)$ time complexity. Initially I implement it by a DFS recursive calling. It is very shocking that I find the permutation even consumes more resources than multiplication operation (while this is another story, should not blame for permute). A simple optimization is replacing recursion with loop, and it gets 20x faster for high dimension ndarray (like 512^3 with more than 4 dims).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

note.md

note.md

Trivial Details...

Matrix Multiplication

NdArray Permute

Files

note.md

Latest commit

History

note.md

File metadata and controls

Trivial Details...

Matrix Multiplication

NdArray Permute