-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in vectorized loops #40276
Comments
Also see the discussion here. |
I can repro. The analyzer says (using
while for the case where it works it says:
|
After digging into the generated LLVM IR, I find that this might be caused by a LLVM issue similar to #41637 The key here is So what happens here? The answer is: it is caused by the second simplifyCFG pass of the LLVM optimizer. m,n = size(dst)
if m == 4
if n == length(src)
for i in eachindex(src)
w1,w2,w3,w4 = ....
m,n = size(dst)
dst[1+m*i] = w1
m,n = size(dst)
dst[2+m*i] = w2
m,n = size(dst)
dst[3+m*i] = w3
m,n = size(dst)
dst[4+m*i] = w4
end
end
else
AssertionError()
end Note the repeat of In Julia 1.5, the branch condition In constract, In Julia 1.6, two branches m,n = size(dst)
b1 = (m == 4)
b2 = (n == length(src))
b = select(b2,b1,false)
if b
#loop
else
AssertionError()
end
To demonstrate this, we can remove the second branch, so two branches won't be combined together. And now it vectorizes! : julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia/static #one branch version:
function compute_weights2!(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
if size(dst)[1] == 4
@inbounds for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
dst[1,i] = w1
dst[2,i] = w2
dst[3,i] = w3
dst[4,i] = w4
end
end
return dst
end
Here #two branches version:
function compute_weights1!(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
if size(dst) == (4,length(src))
@inbounds for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
dst[1,i] = w1
dst[2,i] = w2
dst[3,i] = w3
dst[4,i] = w4
end
end
return dst
end
In summary, branchless code isn't always better than code with branches. |
I am really impressed by all these checks! Following the guidelines by @ChenNingCong, I have written the following code: module TestIssue40276
using BenchmarkTools
function compute_weights(t::T) where {T<:AbstractFloat}
u = 1 - t
v = (T(-1)/2)*t*u
w1 = v*u
w4 = v*t
w = w4 - w1
w2 = u - w1 + w
w3 = t - w4 - w
return (w1, w2, w3, w4)
end
function compute_weights_1(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
@assert size(dst) == (4,length(src))
@inbounds @simd for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
dst[1,i] = w1
dst[2,i] = w2
dst[3,i] = w3
dst[4,i] = w4
end
return dst
end
function compute_weights_2(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
if size(dst) == (4, length(src))
@inbounds @simd for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
dst[1,i] = w1
dst[2,i] = w2
dst[3,i] = w3
dst[4,i] = w4
end
else
throw(AssertionError())
end
return dst
end
function compute_weights_3(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
m,n = size(dst)
if m == 4 && n == length(src)
@inbounds @simd for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
dst[1,i] = w1
dst[2,i] = w2
dst[3,i] = w3
dst[4,i] = w4
end
else
throw(AssertionError())
end
return dst
end
function compute_weights_4(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
@assert size(dst) == (4,length(src))
if size(dst,1) == 4
@inbounds @simd for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
dst[1,i] = w1
dst[2,i] = w2
dst[3,i] = w3
dst[4,i] = w4
end
end
return dst
end
# With computed offset.
function compute_weights_5(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
@assert size(dst) == (4,length(src))
@inbounds @simd for i in eachindex(src)
w1, w2, w3, w4 = compute_weights(src[i])
j = 4*(i-1)
dst[j+1] = w1
dst[j+2] = w2
dst[j+3] = w3
dst[j+4] = w4
end
return dst
end
function runtests(; T::Type{<:AbstractFloat}=Float32, n::Int=1_000)
t = rand(T, n) .- one(T)/2
z = Array{T}(undef, 4, n)
print("Tests with Julia-", VERSION, ", T=", T, ", n=", n, "\n")
print("compute_weights_1"); @btime $(compute_weights_1)($z, $t)
print("compute_weights_2"); @btime $(compute_weights_2)($z, $t)
print("compute_weights_3"); @btime $(compute_weights_3)($z, $t)
print("compute_weights_4"); @btime $(compute_weights_4)($z, $t)
print("compute_weights_5"); @btime $(compute_weights_5)($z, $t)
nothing
end
runtests(T=Float32)
end # module and checked against the different Julia versions:
The issue only appears at Julia 1.6 and as pointed by @ChenNingCong, the two branch versions |
As a rule of thumb and to summarize my understanding:
The following variants appear to be equally fast: @assert size(dst,1) == 4 && size(dst,2) == length(src)
@inbounds @simd for i in eachindex(src)
... # same loop body
end or (better because if size(dst,1) == 4 && size(dst,2) == length(src)
@inbounds @simd for i in eachindex(src)
... # same loop body
end
else
throw(AssertionError())
end or m,n = size(dst)
if m == 4 && n == length(src)
@inbounds @simd for i in eachindex(src)
... # same loop body
end
else
throw(AssertionError())
end It is also possible to use the non short-circuit |
This wins a factor better than ~30 to rotate a 512×512 image in Float64 with a CatmulRom kernel (of size 4) (226ms with many small allocations -> 7ms with no allocations). This solves JuliaLang/julia#40276
Appears vectorized to me on master |
When switching from Julia-1.5.4 to Julia-1.6.0, I had a performance regression (by a factor between 2 and 3) in loops involving functions that return a tuple of values. My guess is that such loops stop being correctly vectorized (in spite of
@inbounds @simd
). Below is a short example to demonstrate the problem:Using
git bisect
I was able to figure out that the first Julia version showing the issue is fe1253ee258674844b8c035. For this commit, the above test yields:The timings for the previous version (commit 49b8e61a80b8108ca0a23f8 are:
So the fastest version runs at 19.3 Gflops while the other version only runs at 7.3 Gflops.
The text was updated successfully, but these errors were encountered: