-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various functions don't have cutoffs for when to stop unrolling #439
Comments
For some historical context, the general rule of thumb has been if the generated code is O( When/if we start experimenting with allocation elimination in v1.x these O( |
I think the way |
I didn't see it? Are you referring to the recursive way the expression is constructed here? ( I guess the compiler itself might not be able to deal with the resulting method in linear time... but in theory at least I guess that it should be able to do so. |
Ah, I guess you're right. I thought that Lines 287 to 289 in 7ddb8e4
would allocate a new expr = :(a + b)
:($expr + c) becomes |
I’m not sure. The new lowered IR is linear (intermediate expressions are SSA assignments) - I wonder how fast the transformation is? OTOH I think |
Yeah, changing things so that it actually does become In any case, seems to me like this should just be a call to StaticArrays.jl/src/mapreduce.jl Lines 92 to 96 in 7ddb8e4
So given the linear IR in 0.7, am I right in thinking that something more like By the way, do you think the scaling done in |
Note that in SSA form, if we unroll that loop, the IR creates a new variable for every assignment. If the binding has the same name in your code, lowering will rename it (note that this has lots of advantages for the compiler - notably variables can now change type without causing type inference failure!). So yes it's the same after lowering. Loops are different and tend to use something called phi nodes where bindings get "replaced" in every iteration of the loop. We'd have to experiment when a loop is preferable. (Some people would argue we should always use loops and let the compiler / LLVM choose when to unroll.)
Possibly? For my geometry work, I'm not sure it would be worth the overhead for me, but it really depends on the use case. This is one of the trickiest parts of StaticArrays - knowing what precision to keep vs. speed. Generally we've let |
Yep, sounds good.
I definitely don't. |
There have been various off-line discussions of the latency issues with StaticArrays and the conclusion is essentially this issue, i.e. that the unrolling makes the code so big that it can be very slow for LLVM to consume. As shown in #631, the latency issues might be significant with julia> tnt = map(1:6) do n
(
n = n,
ctime = let t = time()
x = randn(n^2)
v = ForwardDiff.hessian(x) do _x
_n = isqrt(length(_x))
A = SMatrix{_n,_n}(_x...)
y = @SVector randn(_n)
return y'*(lu(A'*A)\y)
end
time() - t
end
)
end
6-element Vector{NamedTuple{(:n, :ctime), Tuple{Int64, Float64}}}:
(n = 1, ctime = 0.9155991077423096)
(n = 2, ctime = 1.306809902191162)
(n = 3, ctime = 5.09375)
(n = 4, ctime = 8.366412162780762)
(n = 5, ctime = 23.909090995788574)
(n = 6, ctime = 130.0966489315033) (I have a suspicion that Our application doesn't stop at |
Improving that specific case wouldn't be too hard. Matrix multiplication already has fallbacks of different level of unrolling: StaticArrays.jl/src/matrix_multiply.jl Line 130 in 8ca11f8
sa[1] there to something like 8*sa[1]/sizeof(Ta) etc, and limit unrolling to isbits types.
Similarly I could review such changes. |
Yes. To be clear the current generated code was only ever meant to be the first cut, suitable for things like the 3-vectors and 3x3 matrices I was using at the time. All the functions should have fallbacks like matrix multiply does (which obviously scales worse so was more important to cover early on). I haven’t looked into what’s possible now with mutating vs non-mutating approaches etc on the current compiler. Hopefully though you can keep say 10 to 100 values (especially |
The one thing I wish Julia had to support this nicely was mutable version of tuple. The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your data “flat” in an array without extra indirection, typically). |
Converting to an |
Except that there is a compiler issue currently where the conversions are not no-ops, but actually involve separate (stack) allocations and a fully unrolled load/store between them. |
From #430 (comment), e.g.:
There are many more, for example
map
,mapreduce
andbroadcast
.The text was updated successfully, but these errors were encountered: