Don't force loop unrolling in reductions #494

KristofferC · 2018-09-13T17:35:06Z

Instead of unrolling the whole loop in the reduction, we can just keep the loop and let LLVM do its job with loop unrolling (since the number of loop iterations is known).

Calling this inner function directly we can see that for larger matrices this can give good speedup:

f(s) =  StaticArrays._mapreduce(identity, +, Colon(), NamedTuple(), StaticArrays.same_size(s), s)
s = rand(SMatrix{8,8})

After PR:

julia> @btime f($s)
  10.179 ns (0 allocations: 0 bytes)

Before PR

julia> @btime f($s)
  30.229 ns (0 allocations: 0 bytes)

The reason for this is that the loop vectorizer has a much easier time with the loop than the SLP with the unrolled code.

Now, why am I calling the inner function directly? Well, for some reason, Julia refuses to inline this new function. Instead, we get:

julia> @code_typed sum(s)
CodeInfo(
181 1 ─ %1 = StaticArrays.:+::typeof(+)                                               │╻  #sum#255
    │   %2 = invoke #kw##mapreduce()($(QuoteNode((dims = Colon(),)))::NamedTuple{(:dims,),Tuple{Colon}}, StaticArrays.mapreduce::typeof(mapreduce), StaticArrays.identity::Function, %1::Function, _2::SArray{Tuple{8,8},Float64,2,64})::Float64
    └──      return %2                                                                │
) => Float64

and the result is not pretty:

julia> @btime sum($s)
  121.209 ns (3 allocations: 1.08 KiB)

But if we can fix that inlining problem, this is probably worth doing.

KristofferC · 2018-09-13T17:44:13Z

Tests fail because reductions now use SIMD and the tests check ==.

andyferris · 2018-09-17T23:55:23Z

Tests fail because reductions now use SIMD and the tests check ==.

Happy to switch to isapprox with appropriate tolerence.

KristofferC · 2018-09-18T21:06:13Z

Inlining is fixed by JuliaLang/julia#29258 so now there is a reason to finish this up :)

KristofferC · 2018-09-18T21:47:38Z

Updated.

Some benchmarks made on JuliaLang/julia#29258

using BenchmarkTools
using StaticArrays
x = rand(MMatrix{8,8}); s = rand(SMatrix{8,8});

Before

julia> @btime map!(x -> x*2, $x, $s);
  11.056 ns (0 allocations: 0 bytes)

julia> @btime sum($s);
  29.363 ns (0 allocations: 0 bytes)

julia> @btime sum(abs2, $s);
  124.234 ns (3 allocations: 1.08 KiB)

julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
  134.850 ns (4 allocations: 1.09 KiB)

After

julia> @btime map!(x -> x*2, $x, $s);
  12.783 ns (0 allocations: 0 bytes)

julia> @btime sum($s);
  12.906 ns (0 allocations: 0 bytes)

julia> @btime sum(abs2, $s);
  12.283 ns (0 allocations: 0 bytes)

julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
  12.399 ns (0 allocations: 0 bytes)

KristofferC · 2018-09-18T21:52:08Z

Arguably the old methods should be left and we should @static if based on the VERSION... Thoughts?

coveralls · 2018-09-18T22:18:12Z

Coverage decreased (-0.07%) to 94.923% when pulling 5eb628a on KristofferC:kc/loop into 81dd6ca on JuliaArrays:master.

src/linalg.jl

andyferris · 2018-09-20T03:44:37Z

src/linalg.jl

    return quote
        @_inline_meta
-        @inbounds return $expr
+        s = zero(promote_op(*, eltype(a), eltype(b)))


Pity... I was just deleting code which relies on inference...

I can move this to only be used in the zero length case again.

Yes, that would be good I think. Will the SIMD loop vectoriser still work if you peel off the first element in the loop, though? (Or maybe there is a better way?)

By the way, these were the one place where the Union{} eltype created errors (because of zero(Union{}).

andyferris · 2018-09-20T03:45:59Z

One interesting thing about this - this optimization goes with the assumption that StaticArrays have dense (or at least SIMD-fetchable) layout. I wonder how this works on things like the computed Rotations matrices? (Not that you'd ever want to do any of these operations on those!) I really wish we had traits about layout...

Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices.

tkoolen · 2018-09-20T21:05:37Z

Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices.

I was also wondering about this.

KristofferC · 2018-09-20T21:58:38Z

using BenchmarkTools
using StaticArrays
using LinearAlgebra
for siz in (1,2,3,4,8)
    println("size = $siz x $siz")
    # Refs to avoid inlining into benchmark loop
    s = Ref(rand(SMatrix{siz, siz}))
    @btime sum(abs2, $s[]);
    @btime sum($s[]);
    @btime mapreduce(x->x^2, +, $s[]; init=0.5);
    @btime dot($s[], $s[])
    @btime norm($s[], 1)
    @btime norm($s[], 2)
    @btime norm($s[], 5)
    @btime det($s[])
end

PR          Master
size = 1 x 1 | size = 1 x 1
1.596 ns | 1.503 ns
1.503 ns | 1.503 ns
1.792 ns | 1.909 ns
1.596 ns | 1.504 ns
1.503 ns | 1.503 ns
4.155 ns | 4.155 ns
59.980 ns | 60.205 ns
1.503 ns | 1.596 ns
size = 2 x 2 | size = 2 x 2
1.707 ns | 1.795 ns
1.712 ns | 1.504 ns
2.037 ns | 1.803 ns
1.803 ns | 2.089 ns
2.047 ns | 2.087 ns
4.715 ns | 4.157 ns
125.824 ns | 124.740 ns
1.503 ns | 1.597 ns
size = 3 x 3 | size = 3 x 3
3.709 ns | 2.373 ns
2.375 ns | 2.693 ns
2.034 ns | 2.539 ns
2.710 ns | 2.979 ns
2.980 ns | 3.073 ns
3.055 ns | 2.988 ns
4.716 ns | 4.163 ns
233.326 ns | 231.020 ns
2.391 ns | 2.393 ns
size = 4 x 4 | size = 4 x 4
2.712 ns | 5.417 ns
2.044 ns | 4.458 ns
3.038 ns | 6.229 ns
6.893 ns | 6.899 ns
2.709 ns | 5.494 ns
4.718 ns | 6.965 ns
385.665 ns | 382.798 ns
14.572 ns | 14.579 ns
size = 8 x 8 | size = 8 x 8
41.797 ns | 132.093 ns (3 allocations: 1.08 KiB)
29.766 ns | 29.393 ns
14.865 ns | 144.587 ns (4 allocations: 1.09 KiB)
47.807 ns | 73.454 ns
30.111 ns | 33.949 ns
48.478 ns | 37.048 ns
1.651 μs | 1.421 μs
351.538 ns | 355.686 ns

Doesnt seem to be an improvement in all cases, will look at it closer.

also do linalg dont force unroll loop in reductions

KristofferC · 2018-09-21T18:10:31Z

Removed the map! loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

chriselrod · 2023-03-25T21:48:14Z

Removed the map! loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

Could you try un-reverting that change and see if it is better now?
Perhaps use @simd ivdep to indicate non-aliasing (via the ivdep)?

I would like to see this get merged.

KristofferC mentioned this pull request Sep 13, 2018

some updates and fixes to inlining cost JuliaLang/julia#27857

Merged

tkoolen mentioned this pull request Sep 17, 2018

Regression of reduce (formerly reducedim) in Julia 0.7 #498

Open

JeffBezanson mentioned this pull request Sep 18, 2018

always obey inline declarations if the calling signature is concrete JuliaLang/julia#29258

Merged

KristofferC force-pushed the kc/loop branch from 79b7283 to d233823 Compare September 18, 2018 21:44

KristofferC force-pushed the kc/loop branch from d233823 to 79e851e Compare September 18, 2018 21:51

andyferris reviewed Sep 20, 2018

View reviewed changes

src/linalg.jl Show resolved Hide resolved

andyferris reviewed Sep 20, 2018

View reviewed changes

KristofferC force-pushed the kc/loop branch from 5eb628a to dfdfd7c Compare September 21, 2018 18:07

KristofferC added 2 commits September 21, 2018 14:09

remove inference stuff

7898e02

also do linalg dont force unroll loop in reductions

remove looping in map!

e283ff5

KristofferC force-pushed the kc/loop branch from dfdfd7c to e283ff5 Compare September 21, 2018 18:09

andyferris mentioned this pull request Sep 25, 2018

Change README rule of thumb #506

Open

KristofferC mentioned this pull request Apr 30, 2019

Use @simd in _vecdot #603

Merged

c42f added the performance runtime performance label Aug 1, 2019

KristofferC mentioned this pull request Jan 29, 2020

make reshaping SizedArrays not turn them into SArrays #719

Merged

c42f mentioned this pull request Mar 1, 2020

Discussion: should we use divide-and-conquer reduce for instruction-level parallelism? #748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't force loop unrolling in reductions #494

Don't force loop unrolling in reductions #494

KristofferC commented Sep 13, 2018 •

edited

Loading

KristofferC commented Sep 13, 2018 •

edited

Loading

andyferris commented Sep 17, 2018

KristofferC commented Sep 18, 2018

KristofferC commented Sep 18, 2018

KristofferC commented Sep 18, 2018

coveralls commented Sep 18, 2018 •

edited

Loading

andyferris Sep 20, 2018

KristofferC Sep 20, 2018

andyferris Sep 20, 2018

andyferris commented Sep 20, 2018

tkoolen commented Sep 20, 2018

KristofferC commented Sep 20, 2018 •

edited

Loading

KristofferC commented Sep 21, 2018 •

edited

Loading

chriselrod commented Mar 25, 2023

Don't force loop unrolling in reductions #494

Are you sure you want to change the base?

Don't force loop unrolling in reductions #494

Conversation

KristofferC commented Sep 13, 2018 • edited Loading

KristofferC commented Sep 13, 2018 • edited Loading

andyferris commented Sep 17, 2018

KristofferC commented Sep 18, 2018

KristofferC commented Sep 18, 2018

KristofferC commented Sep 18, 2018

coveralls commented Sep 18, 2018 • edited Loading

andyferris Sep 20, 2018

Choose a reason for hiding this comment

KristofferC Sep 20, 2018

Choose a reason for hiding this comment

andyferris Sep 20, 2018

Choose a reason for hiding this comment

andyferris commented Sep 20, 2018

tkoolen commented Sep 20, 2018

KristofferC commented Sep 20, 2018 • edited Loading

KristofferC commented Sep 21, 2018 • edited Loading

chriselrod commented Mar 25, 2023

KristofferC commented Sep 13, 2018 •

edited

Loading

KristofferC commented Sep 13, 2018 •

edited

Loading

coveralls commented Sep 18, 2018 •

edited

Loading

KristofferC commented Sep 20, 2018 •

edited

Loading

KristofferC commented Sep 21, 2018 •

edited

Loading