Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various functions don't have cutoffs for when to stop unrolling #439

Open
tkoolen opened this issue Jun 9, 2018 · 14 comments
Open

Various functions don't have cutoffs for when to stop unrolling #439

tkoolen opened this issue Jun 9, 2018 · 14 comments
Labels
compile-time-performance design speculative design related issue

Comments

@tkoolen
Copy link
Contributor

tkoolen commented Jun 9, 2018

From #430 (comment), e.g.:

julia> using StaticArrays

julia> SA = rand(SMatrix{50,50,Complex{Float64}});

julia> @time vecnorm(SA);
 83.789766 seconds (137.97 M allocations: 3.949 GiB, 1.25% gc time)

There are many more, for example map, mapreduce and broadcast.

@andyferris
Copy link
Member

For some historical context, the general rule of thumb has been if the generated code is O(length(a)) then we always unroll (remember it's O(length(a)) code just to define the tuple, it takes up O(length(a)) space on the stack, etc). For things like matrix multiplication we add cutoffs.

When/if we start experimenting with allocation elimination in v1.x these O(length(a)) in theory methods can be replaced with O(1) code via loops and mutation, with unrolling for small length(a).

@tkoolen
Copy link
Contributor Author

tkoolen commented Jun 11, 2018

I think the way _vecnorm is currently implemented, compile time does not scale linearly, but it could of course.

@andyferris
Copy link
Member

andyferris commented Jun 11, 2018

I didn't see it? Are you referring to the recursive way the expression is constructed here? (Expr is not an isbitstype, and AFAIK interpolation should just be putting a pointer into a new small expression at every iteration, and thus be O(N) though with O(N) allocations and a "linked-list" overall structure).

I guess the compiler itself might not be able to deal with the resulting method in linear time... but in theory at least I guess that it should be able to do so.

@tkoolen
Copy link
Contributor Author

tkoolen commented Jun 11, 2018

Ah, I guess you're right. I thought that

for j = 2:prod(S)
expr = :($expr + abs2(a[$j]))
end

would allocate a new Vector containing :+ and its (growing number of) arguments at each iteration (as the args field of an expression), but

expr = :(a + b)
:($expr + c)

becomes :((a + b) + c), not :(a + b + c) as I thought. Maybe the deep expression nesting is a problem for the compiler then?

@andyferris
Copy link
Member

I’m not sure. The new lowered IR is linear (intermediate expressions are SSA assignments) - I wonder how fast the transformation is? OTOH I think +(1,2,3,4,5,6,7,8) might be slower on v0.6. Even on v0.7 +(...) might take a while to compile (and potentially execute).

@tkoolen
Copy link
Contributor Author

tkoolen commented Jun 11, 2018

Yeah, changing things so that it actually does become +(1,2,3,4,5,6,7,8) doesn't look like a good idea.

In any case, seems to me like this should just be a call to mapreduce, after which this discussion still applies to _mapreduce:

expr = :v0
for i 1:prod(S)
tmp = [:(a[$j][$i]) for j 1:length(a)]
expr = :(op($expr, f($(tmp...))))
end

So given the linear IR in 0.7, am I right in thinking that something more like Base.mapreduce, https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/base/reduce.jl#L192-L197, where a value is updated repeatedly, would be the same thing for the compiler? Anyway, I'll try something like that later.

By the way, do you think the scaling done in LinearAlgebra.vecnorm2 (https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/stdlib/LinearAlgebra/src/generic.jl#L298-L322) is also needed in StaticArrays?

@andyferris
Copy link
Member

So given the linear IR in 0.7, am I right in thinking that something more like Base.mapreduce, https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/base/reduce.jl#L192-L197, where a value is updated repeatedly, would be the same thing for the compiler?

Note that in SSA form, if we unroll that loop, the IR creates a new variable for every assignment. If the binding has the same name in your code, lowering will rename it (note that this has lots of advantages for the compiler - notably variables can now change type without causing type inference failure!). So yes it's the same after lowering.

Loops are different and tend to use something called phi nodes where bindings get "replaced" in every iteration of the loop. We'd have to experiment when a loop is preferable. (Some people would argue we should always use loops and let the compiler / LLVM choose when to unroll.)

By the way, do you think the scaling done in LinearAlgebra.vecnorm2 (https://github.com/JuliaLang/julia/blob/d555a9a3874f4c47e16d1fdcbae4866cc0d3917f/stdlib/LinearAlgebra/src/generic.jl#L298-L322) is also needed in StaticArrays?

Possibly? For my geometry work, I'm not sure it would be worth the overhead for me, but it really depends on the use case. This is one of the trickiest parts of StaticArrays - knowing what precision to keep vs. speed. Generally we've let Base and LinearAlgebra be precise and StaticArrays be fast. We could switch to precise by default and start using the @fastmath macros for speed, if people felt strongly about it.

@tkoolen
Copy link
Contributor Author

tkoolen commented Jun 12, 2018

Generally we've let Base and LinearAlgebra be precise and StaticArrays be fast

Yep, sounds good.

if people felt strongly about it

I definitely don't.

@andreasnoack
Copy link
Member

There have been various off-line discussions of the latency issues with StaticArrays and the conclusion is essentially this issue, i.e. that the unrolling makes the code so big that it can be very slow for LLVM to consume. As shown in #631, the latency issues might be significant with Float64 elements but the (manual) unrolling strategy really falls short for larger element types. In our application we mix StaticArrays with ForwardDiff and that completely invalidates any heuristics used here for when to unroll. An example

julia> tnt = map(1:6) do n
           (
               n = n,
               ctime = let t = time()
                   x = randn(n^2)
                   v = ForwardDiff.hessian(x) do _x
                       _n = isqrt(length(_x))
                       A = SMatrix{_n,_n}(_x...)
                       y = @SVector randn(_n)
                       return y'*(lu(A'*A)\y)
                   end
                   time() - t
               end
           )
           end
6-element Vector{NamedTuple{(:n, :ctime), Tuple{Int64, Float64}}}:
 (n = 1, ctime = 0.9155991077423096)
 (n = 2, ctime = 1.306809902191162)
 (n = 3, ctime = 5.09375)
 (n = 4, ctime = 8.366412162780762)
 (n = 5, ctime = 23.909090995788574)
 (n = 6, ctime = 130.0966489315033)

(I have a suspicion that lu is one of the functions particularly prone to this)

Our application doesn't stop at 6 or at second order derivates so we have to spend a lot of time on reducing chunk sizes. That might be unavoidable but it would certainly help if the compiler had a chance to decide when to unroll. Hopefully it would be possible to completely drop the manual unrolling here (or at least limit it to methods restricted to Float32/Float64 elements) and instead switch to a loop based approach via MArray. The compiler might not yet be ready for this but I think it would be good to start exploring this and give examples where the compiler produces inefficient code. We will probably be able to help with such rewriting of the library if the approach is acceptable.

@mateuszbaran
Copy link
Collaborator

Improving that specific case wouldn't be too hard. Matrix multiplication already has fallbacks of different level of unrolling:

@generated function _mul(Sa::Size{sa}, Sb::Size{sb}, a::StaticMatMulLike{<:Any, <:Any, Ta}, b::StaticMatMulLike{<:Any, <:Any, Tb}) where {sa, sb, Ta, Tb}
so very likely it would be enough to change sa[1] there to something like 8*sa[1]/sizeof(Ta) etc, and limit unrolling to isbits types.

Similarly lu and matrix division could just call the generic implementations for large matrices.

I could review such changes.

@andyferris
Copy link
Member

andyferris commented May 24, 2022

Yes.

To be clear the current generated code was only ever meant to be the first cut, suitable for things like the 3-vectors and 3x3 matrices I was using at the time. All the functions should have fallbacks like matrix multiply does (which obviously scales worse so was more important to cover early on).

I haven’t looked into what’s possible now with mutating vs non-mutating approaches etc on the current compiler. Hopefully though you can keep say 10 to 100 values (especially Union{Missing, T}’s) on the stack and iterate over them (both reading and writing) and output an immutable vector with efficient codegen (small code, no extra allocations, usage of SIMD, etc). It’s been quite straightforward since forever to do such things with MVector but no one has put in the time. Personally I’m really hoping someone with the interest and time can dig in and find the best modern approach (I know it won’t be me).

@andyferris
Copy link
Member

The one thing I wish Julia had to support this nicely was mutable version of tuple. The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically).

@KristofferC
Copy link
Contributor

The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically).

Converting to an MArray, mutating, and then converting back usually works well and does not allocate, as long as things get inlined properly.

@chriselrod
Copy link
Contributor

Except that there is a compiler issue currently where the conversions are not no-ops, but actually involve separate (stack) allocations and a fully unrolled load/store between them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compile-time-performance design speculative design related issue
Projects
None yet
Development

No branches or pull requests

7 participants