-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
column-major access and other perf improvements for generic triangular solves #14475
Conversation
The Triangular types are immutable and in those cases LLVM is usually able to hoist the load. |
Ah, thanks! :) |
The diff was unreadable due to accidental reordering of the method definitions; it is now more readable. I noticed I introduced some extra spaces (mapped |
else | ||
x[j] = Ajj\xj | ||
@inbounds for j in n:-1:1 | ||
A.data[j,j] == zero(A.data[j,j]) && throw(SingularException(j)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might help to do Ajj = A.data[j,j]
once instead of 3x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so as well. So I benchmarked. No significant performance impact. I suppose the compiler is intelligent enough to perform that optimization automagically?
For benefit of future readers, alongside the other optimization notes I added sentences addressing this optimization and manually hoisting A.data out of the loops.
Thanks!
…Also decorate these methods with @inbounds and directly index the object underlying the triangular object. Closes #14471.
This great although this methods shouldn't be called for most element types for which memory access is the bottleneck (since they should use BLAS). Usually, we don't make comments in the source like the ones you have added, so I'd suggest that they are removed. You can put them in a comment in this issue. In that way, we can find the conclusions from your benchmarking again when we need them. |
Lack of comments in the source is a bug, not a feature. |
+100 I think we should push for more comments from all developers. |
Absolutely. A more realistic test case for such methods would be great. A performant, light-weight extended-precision type like double-double came to mind. Can you think of others? |
Consensus on the comments? Should they stay or go? Is this otherwise in shape to merge? Thanks! |
Sorry that this stalled. The level of detail in the comments here is higher than I find suitable, e.g. I think that the observations you made could easily change with compiler changes, but this is really a minor issue and @tkelman and @nalimilan are in favor so I'll merge as it is. Thanks for the contribution and patience. |
column-major access and other perf improvements for generic triangular solves
No need for apology! Only so many hours in the day, and it was a holiday besides :).
Thanks for the review and merge! Best, |
This pull request addresses JuliaLang/LinearAlgebra.jl#293 by replacing the
naivesub!
methods in base/linalg/triangular.jl with column-major-access versions. The new methods also benefit from@inbounds
decoration and direct indexing of the object underlying the triangular object; each of these two modifications improves performance significantly on top of the change to column-major access. Benchmark code:On master these benchmarks yield:
On this PR's branch these benchmarks yield:
Point of curiosity: In the methods for upper triangular matrices, I expected backward iteration within columns (
i = j-1:-1:n
, withj
the column index andi
the row index) to perform better than forward iteration within columns (i = 1:j-1
) given the immediately preceding accesses ofx[j]
,b[j]
, andA[j,j]
. But to my surprise the forward iteration is faster: Forward iteration in the upper triangular methods yields performance parity with the lower triangular methods. The native code looks essentially identical. Something with cache behavior? Thoughts? I left the backward iteration in for now, suspecting that this performance observation is hardware-specific and forward iteration might be surprising to readers.I was surprised by how much direct indexing (
A.data
) impacted performance, particularly in the unit triangular case. I wonder whether the overhead of indexing through the triangular wrapper could be reduced. Thoughts? I will play with this a bit and open an issue if I find anything interesting.Manually hoisting the
A.data
reference out of the loops did not impact performance significantly.These and related methods in
triangular.jl
cry out for unification. I will throw together a concept PR.