-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue (possible codegen bug?) #21966
Comments
For me the timing goes to the same when I force the compiler to not inline As for whether the compiler (both gcc and clang) makes the right default choice about the unrolling in the loop, I think that's a llvm/gcc issue. |
It's not immediately clear to me why inlining the I'd love to figure out how to reproduce the optimizations seen in the C++ code from Julia. Not to overstate the importance of a single benchmark, but I have seen a number of similar cases where a 2-3x performance difference in favor of C++ appears to result almost solely from decisions made on loop unrolling and I don't think this is an isolated example. For me, it's currently a decision about whether to rewrite parts of my codebase in C++ for performance or not, so if there's anything I can do to further investigate this issue, I'm quite motivated to do so. |
C++ can see the constants (after inlining). I think you'll probably see similar performance if you put the constant initialization and the computation into the same function (or use |
I tried using |
As long as the C compilier can see the caller and given that it inlines the function, it can certainly know that the arrays don't alias. You can help the compiler here by writing slightly better code function collide_kernel!(n, ρ, ux, uy, ex, ey, w, Q, NX, NY, ω)
@inbounds for y in 1 : NY
for x in 1 : NX
ρxy, uxxy, uyxy = 0.0, 0.0, 0.0
for q in 1 : Q
nq = n[q,x,y]
ρxy += nq
uxxy = muladd(ex[q], nq, uxxy)
uyxy = muladd(ey[q], nq, uyxy)
end
ρ_inv = 1. / ρxy
uxxy *= ρ_inv
uyxy *= ρ_inv
usqr = uxxy * uxxy + uyxy * uyxy
for q in 1 : Q
eu = 3 * (ex[q] * uxxy + ey[q] * uyxy)
neq = ρxy * w[q] * ( 1 + eu + 0.5*eu*eu - 1.5*usqr )
n[q,x,y] = (1 - ω)*n[q,x,y] + ω*neq
end
ρ[x,y], ux[x,y], uy[x,y] = ρxy, uxxy, uyxy
end
end
end The function above along runs in 21ms on my laptop, compare to 49ms of original julia version and 22ms for C. |
Aliasing is pretty much the primary obstacle to loop analyses these days. Without them, the compiler can do very little in the way of LICM (as a result the loop looks bigger, so unrolling heuristics do different things). I have some changes in the pipeline that improve our alias information, at which point that should no longer be a problem. |
(C timing was a little wrong, it should be ~22ms so both C and julia has the same speed) I'm not sure if the unrolling matters much here but the main issue I see is that the first loop can only be vectorized if the compiler can proof that the store to |
So close as dup of #19658 ? |
Yes |
and a reminder to also teach the compiler that different allocations don't alias, so we'd see the same behavior as C++ when inlining is enabled. |
Thank you both for the help, explicitly avoiding the aliasing issue does make most of the performance difference go away. On my machine they aren't quite equal yet, but within 30%, which is good enough for now. And I'm eagerly looking forward to noalias hints, whichever form they may ultimately take. |
Since you're on haswell, I should also mention that fma fusion is currently mostly disabled (though |
I'm generally in the habit of writing explicit |
Consider the following benchmark code, taken from the collision kernel of a lattice Boltzmann simulation:
Benchmarking the code with
(data) = init_data(1000); @benchmark collide_kernel!(data...)
gives a very consistent 67 ms with Julia started withjulia -O3 --check-bounds=no
. The Julia installation in question has a system image built to take advantage of the target architecture (Haswell).To evaluate if this can be optimized further, I ran a C++ version of the same code through clang v4.0.0 (using
--std=c++14 -O3 -march=native
):The C++ code gives a consistent 32 ms, a 2.1x advantage over the Julia code. Comparing the assembly produced by LLVM in both cases, the main differences that stand out involve the use of vector loads/stores and loop unrolling (which is far more extensive in the case of clang). I'm not sufficiently familiar with Julia internals to speculate as to whether this is the consequence of a known limitation or a new issue, so I'm reporting it here just in case.
EDIT: versioninfo() output for completeness:
The text was updated successfully, but these errors were encountered: