Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column-major access and other perf improvements for generic triangular solves #14475

Merged
merged 1 commit into from
Jan 10, 2016

Conversation

Sacha0
Copy link
Member

@Sacha0 Sacha0 commented Dec 23, 2015

This pull request addresses JuliaLang/LinearAlgebra.jl#293 by replacing the naivesub! methods in base/linalg/triangular.jl with column-major-access versions. The new methods also benefit from @inbounds decoration and direct indexing of the object underlying the triangular object; each of these two modifications improves performance significantly on top of the change to column-major access. Benchmark code:

using Benchmarks

m = 5000;
bo = rand(m);
b = bo;
A = eye(m) + randn(m, m)/(2m);
ltA = LowerTriangular(A);
utA = UpperTriangular(A);
ultA = Base.LinAlg.UnitLowerTriangular(A);
uutA = Base.LinAlg.UnitUpperTriangular(A);

# Benchmark
for (matstr, mat) in (
    ("LowerTriangular", ltA), ("UnitLowerTriangular", ultA),
    ("UpperTriangular", utA), ("UnitUpperTriangular", uutA) )
    copy!(b, bo)
    println(matstr)
    show(@benchmark Base.LinAlg.naivesub!(mat, b))
    println()
end

On master these benchmarks yield:

LowerTriangular
================ Benchmark Results ========================
     Time per evaluation: 143.57 ms [141.11 ms, 146.02 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 67
   Number of evaluations: 67
 Time spent benchmarking: 9.78 s

UnitLowerTriangular
================ Benchmark Results ========================
     Time per evaluation: 167.37 ms [164.84 ms, 169.91 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 57
   Number of evaluations: 57
 Time spent benchmarking: 9.73 s

UpperTriangular
================ Benchmark Results ========================
     Time per evaluation: 142.41 ms [140.09 ms, 144.72 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 69
   Number of evaluations: 69
 Time spent benchmarking: 10.00 s

UnitUpperTriangular
================ Benchmark Results ========================
     Time per evaluation: 166.16 ms [163.87 ms, 168.46 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 57
   Number of evaluations: 57
 Time spent benchmarking: 9.65 s

On this PR's branch these benchmarks yield:

LowerTriangular
================ Benchmark Results ========================
  Time per evaluation: 9.13 ms [8.57 ms, 9.68 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 1.00 s

UnitLowerTriangular
================ Benchmark Results ========================
  Time per evaluation: 9.19 ms [8.60 ms, 9.77 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 0.96 s

UpperTriangular
================ Benchmark Results ========================
  Time per evaluation: 10.64 ms [10.03 ms, 11.26 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 1.12 s

UnitUpperTriangular
================ Benchmark Results ========================
  Time per evaluation: 10.65 ms [10.05 ms, 11.24 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
     Memory allocated: 0.00 bytes
Number of allocations: 0 allocations
    Number of samples: 100
Number of evaluations: 100
Time spent benchmarking: 1.11 s

Point of curiosity: In the methods for upper triangular matrices, I expected backward iteration within columns (i = j-1:-1:n, with j the column index and i the row index) to perform better than forward iteration within columns (i = 1:j-1) given the immediately preceding accesses of x[j], b[j], and A[j,j]. But to my surprise the forward iteration is faster: Forward iteration in the upper triangular methods yields performance parity with the lower triangular methods. The native code looks essentially identical. Something with cache behavior? Thoughts? I left the backward iteration in for now, suspecting that this performance observation is hardware-specific and forward iteration might be surprising to readers.

I was surprised by how much direct indexing (A.data) impacted performance, particularly in the unit triangular case. I wonder whether the overhead of indexing through the triangular wrapper could be reduced. Thoughts? I will play with this a bit and open an issue if I find anything interesting.

Manually hoisting the A.data reference out of the loops did not impact performance significantly.

These and related methods in triangular.jl cry out for unification. I will throw together a concept PR.

@KristofferC
Copy link
Member

The Triangular types are immutable and in those cases LLVM is usually able to hoist the load.

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 23, 2015

Ah, thanks! :)

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 23, 2015

The diff was unreadable due to accidental reordering of the method definitions; it is now more readable. I noticed I introduced some extra spaces (mapped x::AbstractVector=b to x::AbstractVector = b); if the former style is preferred let me know and I will remove them.

else
x[j] = Ajj\xj
@inbounds for j in n:-1:1
A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might help to do Ajj = A.data[j,j] once instead of 3x

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so as well. So I benchmarked. No significant performance impact. I suppose the compiler is intelligent enough to perform that optimization automagically?

For benefit of future readers, alongside the other optimization notes I added sentences addressing this optimization and manually hoisting A.data out of the loops.

Thanks!

…Also decorate these methods with @inbounds and directly index the object underlying the triangular object. Closes #14471.
@andreasnoack
Copy link
Member

This great although this methods shouldn't be called for most element types for which memory access is the bottleneck (since they should use BLAS).

Usually, we don't make comments in the source like the ones you have added, so I'd suggest that they are removed. You can put them in a comment in this issue. In that way, we can find the conclusions from your benchmarking again when we need them.

@tkelman
Copy link
Contributor

tkelman commented Dec 28, 2015

Lack of comments in the source is a bug, not a feature.

@nalimilan
Copy link
Member

Lack of comments in the source is a bug, not a feature.

+100 I think we should push for more comments from all developers.

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 28, 2015

this methods shouldn't be called for most element types for which memory access is the bottleneck (since they should use BLAS).

Absolutely. A more realistic test case for such methods would be great. A performant, light-weight extended-precision type like double-double came to mind. Can you think of others?

@Sacha0
Copy link
Member Author

Sacha0 commented Jan 10, 2016

Consensus on the comments? Should they stay or go? Is this otherwise in shape to merge? Thanks!

@andreasnoack
Copy link
Member

Sorry that this stalled. The level of detail in the comments here is higher than I find suitable, e.g. I think that the observations you made could easily change with compiler changes, but this is really a minor issue and @tkelman and @nalimilan are in favor so I'll merge as it is. Thanks for the contribution and patience.

andreasnoack added a commit that referenced this pull request Jan 10, 2016
column-major access and other perf improvements for generic triangular solves
@andreasnoack andreasnoack merged commit a6770b0 into JuliaLang:master Jan 10, 2016
@Sacha0
Copy link
Member Author

Sacha0 commented Jan 10, 2016

Sorry that this stalled.

No need for apology! Only so many hours in the day, and it was a holiday besides :).

Thanks for the contribution and patience.

Thanks for the review and merge! Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants