Skip to content

Latest commit

 

History

History
35 lines (26 loc) · 1.45 KB

note.md

File metadata and controls

35 lines (26 loc) · 1.45 KB

Trivial Details...

Matrix Multiplication

The initial implementation of matrix multiplication is in a simple triple-loop way, even not considering cache-friendly array indexing. The code looks like

for j in 0..n { // every column of B
    for i in 0..m { // every row of A
        for k in 0..k {
            C[i][j] += A[i][k] * B[k][j] // indexing B by every row makes cache miss a lot
        }
    }
}

I modified the indexing order a little when I was going to optimize its outrageously slow speed...

for i in 0..m {
    for k in 0..k {
    	for j in 0..n {
            C[i][j] += A[i][k] * B[k][j] // no cache miss now
        }
    }
}

The reordering makes it about 2x faster (test size is 512^3).

Then, I found the rust compiler could usually apply auto-vectorizing optimization to iterations easily by SIMD, so I converted the indexing to iterating and got about 40 x faster.

NdArray Permute

The permute operation involves the reordering of the actual data, so at least you have to traverse all elements in the ndarray, which indicates a $O(N)$ time complexity. Initially I implement it by a DFS recursive calling. It is very shocking that I find the permutation even consumes more resources than multiplication operation (while this is another story, should not blame for permute). A simple optimization is replacing recursion with loop, and it gets 20x faster for high dimension ndarray (like 512^3 with more than 4 dims).