Various functions don't have cutoffs for when to stop unrolling #439

tkoolen · 2018-06-09T17:27:45Z

julia> using StaticArrays

julia> SA = rand(SMatrix{50,50,Complex{Float64}});

julia> @time vecnorm(SA);
 83.789766 seconds (137.97 M allocations: 3.949 GiB, 1.25% gc time)

There are many more, for example map, mapreduce and broadcast.

The text was updated successfully, but these errors were encountered:

andyferris · 2018-06-11T02:55:02Z

For some historical context, the general rule of thumb has been if the generated code is O(length(a)) then we always unroll (remember it's O(length(a)) code just to define the tuple, it takes up O(length(a)) space on the stack, etc). For things like matrix multiplication we add cutoffs.

When/if we start experimenting with allocation elimination in v1.x these O(length(a)) in theory methods can be replaced with O(1) code via loops and mutation, with unrolling for small length(a).

tkoolen · 2018-06-11T04:17:03Z

I think the way _vecnorm is currently implemented, compile time does not scale linearly, but it could of course.

andyferris · 2018-06-11T04:25:34Z

I didn't see it? Are you referring to the recursive way the expression is constructed here? (Expr is not an isbitstype, and AFAIK interpolation should just be putting a pointer into a new small expression at every iteration, and thus be O(N) though with O(N) allocations and a "linked-list" overall structure).

I guess the compiler itself might not be able to deal with the resulting method in linear time... but in theory at least I guess that it should be able to do so.

tkoolen · 2018-06-11T04:39:05Z

Ah, I guess you're right. I thought that

StaticArrays.jl/src/linalg.jl

Lines 287 to 289 in 7ddb8e4

    
           for j = 2:prod(S) 
        
               expr = :($expr + abs2(a[$j])) 
        
           end

would allocate a new Vector containing :+ and its (growing number of) arguments at each iteration (as the args field of an expression), but

expr = :(a + b)
:($expr + c)

becomes :((a + b) + c), not :(a + b + c) as I thought. Maybe the deep expression nesting is a problem for the compiler then?

andyferris · 2018-06-11T06:25:04Z

I’m not sure. The new lowered IR is linear (intermediate expressions are SSA assignments) - I wonder how fast the transformation is? OTOH I think +(1,2,3,4,5,6,7,8) might be slower on v0.6. Even on v0.7 +(...) might take a while to compile (and potentially execute).

tkoolen · 2018-06-11T11:30:06Z

Yeah, changing things so that it actually does become +(1,2,3,4,5,6,7,8) doesn't look like a good idea.

In any case, seems to me like this should just be a call to mapreduce, after which this discussion still applies to _mapreduce:

StaticArrays.jl/src/mapreduce.jl

Lines 92 to 96 in 7ddb8e4

    
           expr = :v0 
        
           for i ∈ 1:prod(S) 
        
               tmp = [:(a[$j][$i]) for j ∈ 1:length(a)] 
        
               expr = :(op($expr, f($(tmp...)))) 
        
           end

So given the linear IR in 0.7, am I right in thinking that something more like Base.mapreduce, https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/base/reduce.jl#L192-L197, where a value is updated repeatedly, would be the same thing for the compiler? Anyway, I'll try something like that later.

By the way, do you think the scaling done in LinearAlgebra.vecnorm2 (https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/stdlib/LinearAlgebra/src/generic.jl#L298-L322) is also needed in StaticArrays?

andyferris · 2018-06-12T00:17:43Z

So given the linear IR in 0.7, am I right in thinking that something more like Base.mapreduce, https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/base/reduce.jl#L192-L197, where a value is updated repeatedly, would be the same thing for the compiler?

Note that in SSA form, if we unroll that loop, the IR creates a new variable for every assignment. If the binding has the same name in your code, lowering will rename it (note that this has lots of advantages for the compiler - notably variables can now change type without causing type inference failure!). So yes it's the same after lowering.

Loops are different and tend to use something called phi nodes where bindings get "replaced" in every iteration of the loop. We'd have to experiment when a loop is preferable. (Some people would argue we should always use loops and let the compiler / LLVM choose when to unroll.)

By the way, do you think the scaling done in LinearAlgebra.vecnorm2 (https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/stdlib/LinearAlgebra/src/generic.jl#L298-L322) is also needed in StaticArrays?

Possibly? For my geometry work, I'm not sure it would be worth the overhead for me, but it really depends on the use case. This is one of the trickiest parts of StaticArrays - knowing what precision to keep vs. speed. Generally we've let Base and LinearAlgebra be precise and StaticArrays be fast. We could switch to precise by default and start using the @fastmath macros for speed, if people felt strongly about it.

tkoolen · 2018-06-12T00:35:57Z

Generally we've let Base and LinearAlgebra be precise and StaticArrays be fast

Yep, sounds good.

if people felt strongly about it

I definitely don't.

andreasnoack · 2022-05-24T07:10:06Z

There have been various off-line discussions of the latency issues with StaticArrays and the conclusion is essentially this issue, i.e. that the unrolling makes the code so big that it can be very slow for LLVM to consume. As shown in #631, the latency issues might be significant with Float64 elements but the (manual) unrolling strategy really falls short for larger element types. In our application we mix StaticArrays with ForwardDiff and that completely invalidates any heuristics used here for when to unroll. An example

julia> tnt = map(1:6) do n
           (
               n = n,
               ctime = let t = time()
                   x = randn(n^2)
                   v = ForwardDiff.hessian(x) do _x
                       _n = isqrt(length(_x))
                       A = SMatrix{_n,_n}(_x...)
                       y = @SVector randn(_n)
                       return y'*(lu(A'*A)\y)
                   end
                   time() - t
               end
           )
           end
6-element Vector{NamedTuple{(:n, :ctime), Tuple{Int64, Float64}}}:
 (n = 1, ctime = 0.9155991077423096)
 (n = 2, ctime = 1.306809902191162)
 (n = 3, ctime = 5.09375)
 (n = 4, ctime = 8.366412162780762)
 (n = 5, ctime = 23.909090995788574)
 (n = 6, ctime = 130.0966489315033)

(I have a suspicion that lu is one of the functions particularly prone to this)

Our application doesn't stop at 6 or at second order derivates so we have to spend a lot of time on reducing chunk sizes. That might be unavoidable but it would certainly help if the compiler had a chance to decide when to unroll. Hopefully it would be possible to completely drop the manual unrolling here (or at least limit it to methods restricted to Float32/Float64 elements) and instead switch to a loop based approach via MArray. The compiler might not yet be ready for this but I think it would be good to start exploring this and give examples where the compiler produces inefficient code. We will probably be able to help with such rewriting of the library if the approach is acceptable.

mateuszbaran · 2022-05-24T10:18:29Z

Improving that specific case wouldn't be too hard. Matrix multiplication already has fallbacks of different level of unrolling:

StaticArrays.jl/src/matrix_multiply.jl

Line 130 in 8ca11f8

    
           @generated function _mul(Sa::Size{sa}, Sb::Size{sb}, a::StaticMatMulLike{<:Any, <:Any, Ta}, b::StaticMatMulLike{<:Any, <:Any, Tb}) where {sa, sb, Ta, Tb}

so very likely it would be enough to change sa[1] there to something like 8*sa[1]/sizeof(Ta) etc, and limit unrolling to isbits types.

Similarly lu and matrix division could just call the generic implementations for large matrices.

I could review such changes.

andyferris · 2022-05-24T11:50:36Z

Yes.

To be clear the current generated code was only ever meant to be the first cut, suitable for things like the 3-vectors and 3x3 matrices I was using at the time. All the functions should have fallbacks like matrix multiply does (which obviously scales worse so was more important to cover early on).

I haven’t looked into what’s possible now with mutating vs non-mutating approaches etc on the current compiler. Hopefully though you can keep say 10 to 100 values (especially Union{Missing, T}’s) on the stack and iterate over them (both reading and writing) and output an immutable vector with efficient codegen (small code, no extra allocations, usage of SIMD, etc). It’s been quite straightforward since forever to do such things with MVector but no one has put in the time. Personally I’m really hoping someone with the interest and time can dig in and find the best modern approach (I know it won’t be me).

andyferris · 2022-05-24T11:56:06Z

The one thing I wish Julia had to support this nicely was mutable version of tuple. The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically).

KristofferC · 2022-05-25T14:17:04Z

The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically).

Converting to an MArray, mutating, and then converting back usually works well and does not allocate, as long as things get inlined properly.

chriselrod · 2022-05-25T16:00:14Z

Except that there is a compiler issue currently where the conversions are not no-ops, but actually involve separate (stack) allocations and a fully unrolled load/store between them.

wsshin mentioned this issue Sep 17, 2018

Regression of reduce (formerly reducedim) in Julia 0.7 #498

Open

c42f added compile-time-performance design speculative design related issue labels Jul 31, 2019

thchr mentioned this issue May 31, 2022

mapreduce/mapfoldl unrolls fully, failing to compile for large static sizes #1028

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various functions don't have cutoffs for when to stop unrolling #439

Various functions don't have cutoffs for when to stop unrolling #439

tkoolen commented Jun 9, 2018

andyferris commented Jun 11, 2018

tkoolen commented Jun 11, 2018

andyferris commented Jun 11, 2018 •

edited

Loading

tkoolen commented Jun 11, 2018

andyferris commented Jun 11, 2018

tkoolen commented Jun 11, 2018 •

edited

Loading

andyferris commented Jun 12, 2018

tkoolen commented Jun 12, 2018

andreasnoack commented May 24, 2022

mateuszbaran commented May 24, 2022

andyferris commented May 24, 2022 •

edited

Loading

andyferris commented May 24, 2022

KristofferC commented May 25, 2022

chriselrod commented May 25, 2022

Various functions don't have cutoffs for when to stop unrolling #439

Various functions don't have cutoffs for when to stop unrolling #439

Comments

tkoolen commented Jun 9, 2018

andyferris commented Jun 11, 2018

tkoolen commented Jun 11, 2018

andyferris commented Jun 11, 2018 • edited Loading

tkoolen commented Jun 11, 2018

andyferris commented Jun 11, 2018

tkoolen commented Jun 11, 2018 • edited Loading

andyferris commented Jun 12, 2018

tkoolen commented Jun 12, 2018

andreasnoack commented May 24, 2022

mateuszbaran commented May 24, 2022

andyferris commented May 24, 2022 • edited Loading

andyferris commented May 24, 2022

KristofferC commented May 25, 2022

chriselrod commented May 25, 2022

andyferris commented Jun 11, 2018 •

edited

Loading

tkoolen commented Jun 11, 2018 •

edited

Loading

andyferris commented May 24, 2022 •

edited

Loading