column-major access and other perf improvements for generic triangular solves #14475

Sacha0 · 2015-12-23T21:18:10Z

This pull request addresses JuliaLang/LinearAlgebra.jl#293 by replacing the naivesub! methods in base/linalg/triangular.jl with column-major-access versions. The new methods also benefit from @inbounds decoration and direct indexing of the object underlying the triangular object; each of these two modifications improves performance significantly on top of the change to column-major access. Benchmark code:

using Benchmarks

m = 5000;
bo = rand(m);
b = bo;
A = eye(m) + randn(m, m)/(2m);
ltA = LowerTriangular(A);
utA = UpperTriangular(A);
ultA = Base.LinAlg.UnitLowerTriangular(A);
uutA = Base.LinAlg.UnitUpperTriangular(A);

# Benchmark
for (matstr, mat) in (
    ("LowerTriangular", ltA), ("UnitLowerTriangular", ultA),
    ("UpperTriangular", utA), ("UnitUpperTriangular", uutA) )
    copy!(b, bo)
    println(matstr)
    show(@benchmark Base.LinAlg.naivesub!(mat, b))
    println()
end

On master these benchmarks yield:

LowerTriangular
================ Benchmark Results ========================
     Time per evaluation: 143.57 ms [141.11 ms, 146.02 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 67
   Number of evaluations: 67
 Time spent benchmarking: 9.78 s

UnitLowerTriangular
================ Benchmark Results ========================
     Time per evaluation: 167.37 ms [164.84 ms, 169.91 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 57
   Number of evaluations: 57
 Time spent benchmarking: 9.73 s

UpperTriangular
================ Benchmark Results ========================
     Time per evaluation: 142.41 ms [140.09 ms, 144.72 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 69
   Number of evaluations: 69
 Time spent benchmarking: 10.00 s

UnitUpperTriangular
================ Benchmark Results ========================
     Time per evaluation: 166.16 ms [163.87 ms, 168.46 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 57
   Number of evaluations: 57
 Time spent benchmarking: 9.65 s

On this PR's branch these benchmarks yield:

LowerTriangular
================ Benchmark Results ========================
  Time per evaluation: 9.13 ms [8.57 ms, 9.68 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 1.00 s

UnitLowerTriangular
================ Benchmark Results ========================
  Time per evaluation: 9.19 ms [8.60 ms, 9.77 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 0.96 s

UpperTriangular
================ Benchmark Results ========================
  Time per evaluation: 10.64 ms [10.03 ms, 11.26 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 1.12 s

UnitUpperTriangular
================ Benchmark Results ========================
  Time per evaluation: 10.65 ms [10.05 ms, 11.24 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 1.11 s

Point of curiosity: In the methods for upper triangular matrices, I expected backward iteration within columns (i = j-1:-1:n, with j the column index and i the row index) to perform better than forward iteration within columns (i = 1:j-1) given the immediately preceding accesses of x[j], b[j], and A[j,j]. But to my surprise the forward iteration is faster: Forward iteration in the upper triangular methods yields performance parity with the lower triangular methods. The native code looks essentially identical. Something with cache behavior? Thoughts? I left the backward iteration in for now, suspecting that this performance observation is hardware-specific and forward iteration might be surprising to readers.

I was surprised by how much direct indexing (A.data) impacted performance, particularly in the unit triangular case. I wonder whether the overhead of indexing through the triangular wrapper could be reduced. Thoughts? I will play with this a bit and open an issue if I find anything interesting.

Manually hoisting the A.data reference out of the loops did not impact performance significantly.

These and related methods in triangular.jl cry out for unification. I will throw together a concept PR.

KristofferC · 2015-12-23T21:24:42Z

The Triangular types are immutable and in those cases LLVM is usually able to hoist the load.

Sacha0 · 2015-12-23T21:27:27Z

Ah, thanks! :)

Sacha0 · 2015-12-23T23:36:03Z

The diff was unreadable due to accidental reordering of the method definitions; it is now more readable. I noticed I introduced some extra spaces (mapped x::AbstractVector=b to x::AbstractVector = b); if the former style is preferred let me know and I will remove them.

tkelman · 2015-12-24T04:51:18Z

base/linalg/triangular.jl

-        else
-            x[j] = Ajj\xj
+    @inbounds for j in n:-1:1
+        A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))


might help to do Ajj = A.data[j,j] once instead of 3x

I thought so as well. So I benchmarked. No significant performance impact. I suppose the compiler is intelligent enough to perform that optimization automagically?

For benefit of future readers, alongside the other optimization notes I added sentences addressing this optimization and manually hoisting A.data out of the loops.

Thanks!

@inbounds

…Also decorate these methods with @inbounds and directly index the object underlying the triangular object. Closes #14471.

andreasnoack · 2015-12-28T13:22:19Z

This great although this methods shouldn't be called for most element types for which memory access is the bottleneck (since they should use BLAS).

Usually, we don't make comments in the source like the ones you have added, so I'd suggest that they are removed. You can put them in a comment in this issue. In that way, we can find the conclusions from your benchmarking again when we need them.

tkelman · 2015-12-28T14:36:41Z

Lack of comments in the source is a bug, not a feature.

nalimilan · 2015-12-28T14:41:18Z

Lack of comments in the source is a bug, not a feature.

+100 I think we should push for more comments from all developers.

Sacha0 · 2015-12-28T21:11:59Z

this methods shouldn't be called for most element types for which memory access is the bottleneck (since they should use BLAS).

Absolutely. A more realistic test case for such methods would be great. A performant, light-weight extended-precision type like double-double came to mind. Can you think of others?

Sacha0 · 2016-01-10T21:51:19Z

Consensus on the comments? Should they stay or go? Is this otherwise in shape to merge? Thanks!

andreasnoack · 2016-01-10T22:18:33Z

Sorry that this stalled. The level of detail in the comments here is higher than I find suitable, e.g. I think that the observations you made could easily change with compiler changes, but this is really a minor issue and @tkelman and @nalimilan are in favor so I'll merge as it is. Thanks for the contribution and patience.

column-major access and other perf improvements for generic triangular solves

Sacha0 · 2016-01-10T22:44:13Z

Sorry that this stalled.

No need for apology! Only so many hours in the day, and it was a holiday besides :).

Thanks for the contribution and patience.

Thanks for the review and merge! Best,

Sacha0 force-pushed the colmajgtris branch from a19e179 to 183c1dc Compare December 23, 2015 23:33

tkelman reviewed Dec 24, 2015
View reviewed changes

Switch generic triangular solves (naivesub!) to column-major access. …

7ab15d8

…Also decorate these methods with @inbounds and directly index the object underlying the triangular object. Closes #14471.

Sacha0 force-pushed the colmajgtris branch from 183c1dc to 7ab15d8 Compare December 24, 2015 17:29

Sacha0 mentioned this pull request Dec 31, 2015

add missing mul, ldiv, and rdiv variants involving one triangular matrix. also fix #14502. #14508

Merged

andreasnoack added a commit that referenced this pull request Jan 10, 2016

Merge pull request #14475 from Sacha0/colmajgtris

a6770b0

column-major access and other perf improvements for generic triangular solves

andreasnoack merged commit a6770b0 into JuliaLang:master Jan 10, 2016

Sacha0 deleted the colmajgtris branch January 14, 2016 22:19

Sacha0 mentioned this pull request Sep 10, 2016

naivesub! (solving triangular systems): why the less performant choice? JuliaLang/LinearAlgebra.jl#366

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column-major access and other perf improvements for generic triangular solves #14475

column-major access and other perf improvements for generic triangular solves #14475

Sacha0 commented Dec 23, 2015

KristofferC commented Dec 23, 2015

Sacha0 commented Dec 23, 2015

Sacha0 commented Dec 23, 2015

tkelman Dec 24, 2015

Sacha0 Dec 24, 2015

andreasnoack commented Dec 28, 2015

tkelman commented Dec 28, 2015

nalimilan commented Dec 28, 2015

Sacha0 commented Dec 28, 2015

Sacha0 commented Jan 10, 2016

andreasnoack commented Jan 10, 2016

Sacha0 commented Jan 10, 2016

column-major access and other perf improvements for generic triangular solves #14475

column-major access and other perf improvements for generic triangular solves #14475

Conversation

Sacha0 commented Dec 23, 2015

KristofferC commented Dec 23, 2015

Sacha0 commented Dec 23, 2015

Sacha0 commented Dec 23, 2015

tkelman Dec 24, 2015

Choose a reason for hiding this comment

Sacha0 Dec 24, 2015

Choose a reason for hiding this comment

andreasnoack commented Dec 28, 2015

tkelman commented Dec 28, 2015

nalimilan commented Dec 28, 2015

Sacha0 commented Dec 28, 2015

Sacha0 commented Jan 10, 2016

andreasnoack commented Jan 10, 2016

Sacha0 commented Jan 10, 2016