Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't force loop unrolling in reductions #494

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

KristofferC
Copy link
Contributor

@KristofferC KristofferC commented Sep 13, 2018

Instead of unrolling the whole loop in the reduction, we can just keep the loop and let LLVM do its job with loop unrolling (since the number of loop iterations is known).

Calling this inner function directly we can see that for larger matrices this can give good speedup:

f(s) =  StaticArrays._mapreduce(identity, +, Colon(), NamedTuple(), StaticArrays.same_size(s), s)
s = rand(SMatrix{8,8})

After PR:

julia> @btime f($s)
  10.179 ns (0 allocations: 0 bytes)

Before PR

julia> @btime f($s)
  30.229 ns (0 allocations: 0 bytes)

The reason for this is that the loop vectorizer has a much easier time with the loop than the SLP with the unrolled code.

Now, why am I calling the inner function directly? Well, for some reason, Julia refuses to inline this new function. Instead, we get:

julia> @code_typed sum(s)
CodeInfo(
181 1%1 = StaticArrays.:+::typeof(+)                                               │╻  #sum#255%2 = invoke #kw##mapreduce()($(QuoteNode((dims = Colon(),)))::NamedTuple{(:dims,),Tuple{Colon}}, StaticArrays.mapreduce::typeof(mapreduce), StaticArrays.identity::Function, %1::Function, _2::SArray{Tuple{8,8},Float64,2,64})::Float64
    └──      return %2                                                                │
) => Float64

and the result is not pretty:

julia> @btime sum($s)
  121.209 ns (3 allocations: 1.08 KiB)

But if we can fix that inlining problem, this is probably worth doing.

@KristofferC
Copy link
Contributor Author

KristofferC commented Sep 13, 2018

Tests fail because reductions now use SIMD and the tests check ==.

@andyferris
Copy link
Member

Tests fail because reductions now use SIMD and the tests check ==.

Happy to switch to isapprox with appropriate tolerence.

@KristofferC
Copy link
Contributor Author

Inlining is fixed by JuliaLang/julia#29258 so now there is a reason to finish this up :)

@KristofferC
Copy link
Contributor Author

Updated.

Some benchmarks made on JuliaLang/julia#29258

using BenchmarkTools
using StaticArrays
x = rand(MMatrix{8,8}); s = rand(SMatrix{8,8});

Before

julia> @btime map!(x -> x*2, $x, $s);
  11.056 ns (0 allocations: 0 bytes)

julia> @btime sum($s);
  29.363 ns (0 allocations: 0 bytes)

julia> @btime sum(abs2, $s);
  124.234 ns (3 allocations: 1.08 KiB)

julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
  134.850 ns (4 allocations: 1.09 KiB)

After

julia> @btime map!(x -> x*2, $x, $s);
  12.783 ns (0 allocations: 0 bytes)

julia> @btime sum($s);
  12.906 ns (0 allocations: 0 bytes)

julia> @btime sum(abs2, $s);
  12.283 ns (0 allocations: 0 bytes)

julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
  12.399 ns (0 allocations: 0 bytes)

@KristofferC
Copy link
Contributor Author

Arguably the old methods should be left and we should @static if based on the VERSION... Thoughts?

@coveralls
Copy link

coveralls commented Sep 18, 2018

Coverage Status

Coverage decreased (-0.07%) to 94.923% when pulling 5eb628a on KristofferC:kc/loop into 81dd6ca on JuliaArrays:master.

src/linalg.jl Outdated
return quote
@_inline_meta
@inbounds return $expr
s = zero(promote_op(*, eltype(a), eltype(b)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity... I was just deleting code which relies on inference...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move this to only be used in the zero length case again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be good I think. Will the SIMD loop vectoriser still work if you peel off the first element in the loop, though? (Or maybe there is a better way?)

By the way, these were the one place where the Union{} eltype created errors (because of zero(Union{}).

@andyferris
Copy link
Member

One interesting thing about this - this optimization goes with the assumption that StaticArrays have dense (or at least SIMD-fetchable) layout. I wonder how this works on things like the computed Rotations matrices? (Not that you'd ever want to do any of these operations on those!) I really wish we had traits about layout...

Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices.

@tkoolen
Copy link
Contributor

tkoolen commented Sep 20, 2018

Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices.

I was also wondering about this.

@KristofferC
Copy link
Contributor Author

KristofferC commented Sep 20, 2018

using BenchmarkTools
using StaticArrays
using LinearAlgebra
for siz in (1,2,3,4,8)
    println("size = $siz x $siz")
    # Refs to avoid inlining into benchmark loop
    s = Ref(rand(SMatrix{siz, siz}))
    @btime sum(abs2, $s[]);
    @btime sum($s[]);
    @btime mapreduce(x->x^2, +, $s[]; init=0.5);
    @btime dot($s[], $s[])
    @btime norm($s[], 1)
    @btime norm($s[], 2)
    @btime norm($s[], 5)
    @btime det($s[])
end
PR          Master
size = 1 x 1 | size = 1 x 1
1.596 ns | 1.503 ns
1.503 ns | 1.503 ns
1.792 ns | 1.909 ns
1.596 ns | 1.504 ns
1.503 ns | 1.503 ns
4.155 ns | 4.155 ns
59.980 ns | 60.205 ns
1.503 ns | 1.596 ns
size = 2 x 2 | size = 2 x 2
1.707 ns | 1.795 ns
1.712 ns | 1.504 ns
2.037 ns | 1.803 ns
1.803 ns | 2.089 ns
2.047 ns | 2.087 ns
4.715 ns | 4.157 ns
125.824 ns | 124.740 ns
1.503 ns | 1.597 ns
size = 3 x 3 | size = 3 x 3
3.709 ns | 2.373 ns
2.375 ns | 2.693 ns
2.034 ns | 2.539 ns
2.710 ns | 2.979 ns
2.980 ns | 3.073 ns
3.055 ns | 2.988 ns
4.716 ns | 4.163 ns
233.326 ns | 231.020 ns
2.391 ns | 2.393 ns
size = 4 x 4 | size = 4 x 4
2.712 ns | 5.417 ns
2.044 ns | 4.458 ns
3.038 ns | 6.229 ns
6.893 ns | 6.899 ns
2.709 ns | 5.494 ns
4.718 ns | 6.965 ns
385.665 ns | 382.798 ns
14.572 ns | 14.579 ns
size = 8 x 8 | size = 8 x 8
41.797 ns | 132.093 ns (3 allocations: 1.08 KiB)
29.766 ns | 29.393 ns
14.865 ns | 144.587 ns (4 allocations: 1.09 KiB)
47.807 ns | 73.454 ns
30.111 ns | 33.949 ns
48.478 ns | 37.048 ns
1.651 μs | 1.421 μs
351.538 ns | 355.686 ns


Doesnt seem to be an improvement in all cases, will look at it closer.

also do linalg

dont force unroll loop in reductions
@KristofferC
Copy link
Contributor Author

KristofferC commented Sep 21, 2018

Removed the map! loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

@chriselrod
Copy link
Contributor

Removed the map! loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

Could you try un-reverting that change and see if it is better now?
Perhaps use @simd ivdep to indicate non-aliasing (via the ivdep)?

I would like to see this get merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance runtime performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants