-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't force loop unrolling in reductions #494
base: master
Are you sure you want to change the base?
Conversation
Tests fail because reductions now use SIMD and the tests check |
Happy to switch to |
Inlining is fixed by JuliaLang/julia#29258 so now there is a reason to finish this up :) |
79b7283
to
d233823
Compare
Updated. Some benchmarks made on JuliaLang/julia#29258 using BenchmarkTools
using StaticArrays
x = rand(MMatrix{8,8}); s = rand(SMatrix{8,8}); Before julia> @btime map!(x -> x*2, $x, $s);
11.056 ns (0 allocations: 0 bytes)
julia> @btime sum($s);
29.363 ns (0 allocations: 0 bytes)
julia> @btime sum(abs2, $s);
124.234 ns (3 allocations: 1.08 KiB)
julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
134.850 ns (4 allocations: 1.09 KiB) After julia> @btime map!(x -> x*2, $x, $s);
12.783 ns (0 allocations: 0 bytes)
julia> @btime sum($s);
12.906 ns (0 allocations: 0 bytes)
julia> @btime sum(abs2, $s);
12.283 ns (0 allocations: 0 bytes)
julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
12.399 ns (0 allocations: 0 bytes) |
d233823
to
79e851e
Compare
Arguably the old methods should be left and we should |
src/linalg.jl
Outdated
return quote | ||
@_inline_meta | ||
@inbounds return $expr | ||
s = zero(promote_op(*, eltype(a), eltype(b))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pity... I was just deleting code which relies on inference...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move this to only be used in the zero length case again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be good I think. Will the SIMD loop vectoriser still work if you peel off the first element in the loop, though? (Or maybe there is a better way?)
By the way, these were the one place where the Union{}
eltype created errors (because of zero(Union{})
.
One interesting thing about this - this optimization goes with the assumption that Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices. |
I was also wondering about this. |
using BenchmarkTools
using StaticArrays
using LinearAlgebra
for siz in (1,2,3,4,8)
println("size = $siz x $siz")
# Refs to avoid inlining into benchmark loop
s = Ref(rand(SMatrix{siz, siz}))
@btime sum(abs2, $s[]);
@btime sum($s[]);
@btime mapreduce(x->x^2, +, $s[]; init=0.5);
@btime dot($s[], $s[])
@btime norm($s[], 1)
@btime norm($s[], 2)
@btime norm($s[], 5)
@btime det($s[])
end
Doesnt seem to be an improvement in all cases, will look at it closer. |
5eb628a
to
dfdfd7c
Compare
also do linalg dont force unroll loop in reductions
dfdfd7c
to
e283ff5
Compare
Removed the |
Could you try un-reverting that change and see if it is better now? I would like to see this get merged. |
Instead of unrolling the whole loop in the reduction, we can just keep the loop and let LLVM do its job with loop unrolling (since the number of loop iterations is known).
Calling this inner function directly we can see that for larger matrices this can give good speedup:
After PR:
Before PR
The reason for this is that the loop vectorizer has a much easier time with the loop than the SLP with the unrolled code.
Now, why am I calling the inner function directly? Well, for some reason, Julia refuses to inline this new function. Instead, we get:
and the result is not pretty:
But if we can fix that inlining problem, this is probably worth doing.