Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in vectorized loops #40276

Closed
emmt opened this issue Mar 31, 2021 · 6 comments
Closed

Performance regression in vectorized loops #40276

emmt opened this issue Mar 31, 2021 · 6 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@emmt
Copy link
Contributor

emmt commented Mar 31, 2021

When switching from Julia-1.5.4 to Julia-1.6.0, I had a performance regression (by a factor between 2 and 3) in loops involving functions that return a tuple of values. My guess is that such loops stop being correctly vectorized (in spite of @inbounds @simd). Below is a short example to demonstrate the problem:

using BenchmarkTools

function compute_weights(t::T) where {T<:AbstractFloat}
    u = 1 - t
    v = (T(-1)/2)*t*u
    w1 = v*u
    w4 = v*t
    w = w4 - w1
    w2 = u - w1 + w
    w3 = t - w4 - w
    return (w1, w2, w3, w4)
end

function compute_weights!(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    @assert size(dst) == (4,length(src))
    @inbounds @simd for i in eachindex(src)
        w1, w2, w3, w4 = compute_weights(src[i])
        dst[1,i] = w1
        dst[2,i] = w2
        dst[3,i] = w3
        dst[4,i] = w4
    end
    return dst
end

function runtests(; T::Type{<:AbstractFloat}=Float32, n::Int=1_000)
    t = rand(T, n) .- one(T)/2
    z = Array{T}(undef, 4, n)
    print("Tests with Julia-", VERSION, ", T=", T, ", n=", n, "\n")
    print("Compute_weights! "); @btime $(compute_weights!)($z, $t)
    nothing
end

runtests(T=Float32)

Using git bisect I was able to figure out that the first Julia version showing the issue is fe1253ee258674844b8c035. For this commit, the above test yields:

Tests with Julia-1.6.0-DEV.1648, T=Float32, n=1000
Compute_weights!   1.370 μs (0 allocations: 0 bytes)

The timings for the previous version (commit 49b8e61a80b8108ca0a23f8 are:

Tests with Julia-1.6.0-DEV.1647, T=Float32, n=1000
Compute_weights!   518.611 ns (0 allocations: 0 bytes)

So the fastest version runs at 19.3 Gflops while the other version only runs at 7.3 Gflops.

@emmt
Copy link
Contributor Author

emmt commented Mar 31, 2021

Also see the discussion here.

@KristofferC
Copy link
Member

I can repro. The analyzer says (using export JULIA_LLVM_ARGS="--pass-remarks-analysis=loop-vectorize --pass-remarks-missed=loop-vectorize --pass-remarks=loop-vectorize"):

remark: simdloop.jl:75:0: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop

while for the case where it works it says:

remark: simdloop.jl:75:0: vectorized loop (vectorization width: 8, interleaved count: 1)

@KristofferC KristofferC added performance Must go faster regression Regression in behavior compared to a previous version labels Mar 31, 2021
@ChenNingCong
Copy link

After digging into the generated LLVM IR, I find that this might be caused by a LLVM issue similar to #41637

The key here is @assert size(dst) == (4,length(src)). It provides shape information of array, which is essential for vectorization. Without this information, LLVM can't prove writes to dst are disjoint across iterations. That's why we have unsafe dependent memory operations in loop in the warning message reported by @KristofferC .

So what happens here? The answer is: it is caused by the second simplifyCFG pass of the LLVM optimizer. compute_weights! will be lowered into following (pseudo) IR before any LLVM optimization:

m,n = size(dst)
if m == 4 
    if n == length(src)
        for i in eachindex(src)
            w1,w2,w3,w4 = ....
            m,n = size(dst)
            dst[1+m*i] = w1
            m,n = size(dst)
            dst[2+m*i] = w2
            m,n = size(dst)
            dst[3+m*i] = w3
            m,n = size(dst)
            dst[4+m*i] = w4
        end
    end
else
    AssertionError()
end

Note the repeat of m,n = size(dst) in the loop, this is because array shape might change across instructions. Of course they will be removed in the following optimization pass since it's easy to find out that array shape keep constant during iteration.

In Julia 1.5, the branch condition m==4 can be successfully propagated into the loop. So the stride m is a constant 4. LLVM concludes easily that all the writes are disjoint and perform the vectorization. The propagation happens during the GVN (global value numbering pass). GVN will look into the branch condition, and if the condition has the form of A==B, then it will replace occurrence of the complex one with the simple one (of course, replace happens only in those blocks dominated by this branch).

In constract, In Julia 1.6, two branches m==4 and n==length(src) get merged into one branch due to the second simplifyCFG pass before GVN, while in Julia 1.5 branches are not touched. It becomes something like : (original IR is too long, not shown here) :

m,n = size(dst)
b1 = (m == 4)
b2 = (n == length(src))
b = select(b2,b1,false)
if b
    #loop
else
    AssertionError()
end

select(b2,b1,false) is just b2 && b1, but simplifyCFG generates select from branches instead of and. This select will be transformed into and during the instcombine pass after GVN. Now the condition becomes b, a select instruction. But GVN doesn't recognize select (LLVM's newGVN pass does handle select specially, but I guess this pass is not used by Julia?). So m==4 is not propagated into loop during GVN, and later pass also fails to do so.

To demonstrate this, we can remove the second branch, so two branches won't be combined together. And now it vectorizes! :

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia/static
#one branch version:
function compute_weights2!(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    if size(dst)[1] == 4
        @inbounds for i in eachindex(src)
            w1, w2, w3, w4 = compute_weights(src[i])
            dst[1,i] = w1
            dst[2,i] = w2
            dst[3,i] = w3
            dst[4,i] = w4
        end
    end
    return dst
end

@code_llvm debuginfo=:none compute_weights2!(z,t) gives:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %34 = getelementptr inbounds float, float* %16, i64 %index
  %35 = bitcast float* %34 to <8 x float>*
  %wide.load = load <8 x float>, <8 x float>* %35, align 4
  %36 = fsub <8 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %wide.load
  %37 = fmul <8 x float> %wide.load, <float -5.000000e-01, float -5.000000e-01, float -5.000000e-01, float -5.000000e-01, float -5.000000e-01, float -5.000000e-01, float -5.000000e-01, float -5.000000e-01>
  %38 = fmul <8 x float> %37, %36
  %39 = fmul <8 x float> %36, %38
  %40 = fmul <8 x float> %wide.load, %38
  %41 = fsub <8 x float> %40, %39
  %42 = fsub <8 x float> %36, %39
  %43 = fadd <8 x float> %42, %41
  %44 = fsub <8 x float> %wide.load, %40
  %45 = fsub <8 x float> %44, %41
  %46 = shl i64 %index, 2
  %47 = or i64 %46, 3
  %48 = getelementptr inbounds float, float* %18, i64 -3
  %49 = getelementptr inbounds float, float* %48, i64 %47
  %50 = bitcast float* %49 to <32 x float>*
  %51 = shufflevector <8 x float> %39, <8 x float> %43, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %52 = shufflevector <8 x float> %45, <8 x float> %40, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %interleaved.vec = shufflevector <16 x float> %51, <16 x float> %52, <32 x i32> <i32 0, i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18, i32 26, i32 3, i32 11, i32 19, i32 27, i32 4, i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32 22, i32 30, i32 7, i32 15, i32 23, i32 31>
  store <32 x float> %interleaved.vec, <32 x float>* %50, align 4
  %index.next = add i64 %index, 8
  %53 = icmp eq i64 %index.next, %n.vec
  br i1 %53, label %middle.block, label %vector.body

Here z,t is just the data in the runtest function, where the original code (with two branches) posted in this issue gives:

#two branches version:
function compute_weights1!(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    if size(dst) == (4,length(src))
        @inbounds for i in eachindex(src)
            w1, w2, w3, w4 = compute_weights(src[i])
            dst[1,i] = w1
            dst[2,i] = w2
            dst[3,i] = w3
            dst[4,i] = w4
        end
    end
    return dst
end

@code_llvm debuginfo=:none compute_weights1!(z,t)

define nonnull {}* @"japi1_compute_weights1!_216"({}* %0, {}** %1, i32 %2) #0 {
top:
  %3 = alloca {}**, align 8
  store volatile {}** %1, {}*** %3, align 8
  %4 = load {}*, {}** %1, align 8
  %5 = getelementptr inbounds {}*, {}** %1, i64 1
  %6 = load {}*, {}** %5, align 8
  %7 = bitcast {}* %4 to {}**
  %8 = getelementptr inbounds {}*, {}** %7, i64 3
  %9 = bitcast {}** %8 to i64*
  %10 = load i64, i64* %9, align 8
  %.not = icmp ne i64 %10, 4
  %11 = bitcast {}* %6 to { i8*, i64, i16, i16, i32 }*
  %12 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %11, i64 0, i32 1
  %13 = load i64, i64* %12, align 8
  %14 = getelementptr inbounds {}*, {}** %7, i64 4
  %15 = bitcast {}** %14 to i64*
  %16 = load i64, i64* %15, align 8
  %17 = icmp ne i64 %16, %13
  %value_phi = or i1 %.not, %17
  br i1 %value_phi, label %L61, label %L18

L18:                                              ; preds = %top
  %18 = bitcast {}* %6 to {}**
  %19 = getelementptr inbounds {}*, {}** %18, i64 3
  %20 = bitcast {}** %19 to i64*
  %21 = load i64, i64* %20, align 8
  %.not12.not = icmp eq i64 %21, 0
  br i1 %.not12.not, label %L61, label %L31.preheader

L31.preheader:                                    ; preds = %L18
  %22 = bitcast {}* %6 to float**
  %23 = load float*, float** %22, align 8
  %24 = bitcast {}* %4 to float**
  %25 = load float*, float** %24, align 8
  br label %L31

L31:                                              ; preds = %L31, %L31.preheader
  %value_phi4 = phi i64 [ %47, %L31 ], [ 1, %L31.preheader ]
  %26 = add nsw i64 %value_phi4, -1
  %27 = getelementptr inbounds float, float* %23, i64 %26
  %28 = load float, float* %27, align 4
  %29 = fsub float 1.000000e+00, %28
  %30 = fmul float %28, -5.000000e-01
  %31 = fmul float %30, %29
  %32 = fmul float %29, %31
  %33 = fmul float %28, %31
  %34 = fsub float %33, %32
  %35 = fsub float %29, %32
  %36 = fadd float %35, %34
  %37 = fsub float %28, %33
  %38 = fsub float %37, %34
  %39 = shl i64 %26, 2
  %40 = getelementptr inbounds float, float* %25, i64 %39
  store float %32, float* %40, align 4
  %41 = or i64 %39, 1
  %42 = getelementptr inbounds float, float* %25, i64 %41
  store float %36, float* %42, align 4
  %43 = or i64 %39, 2
  %44 = getelementptr inbounds float, float* %25, i64 %43
  store float %38, float* %44, align 4
  %45 = or i64 %39, 3
  %46 = getelementptr inbounds float, float* %25, i64 %45
  store float %33, float* %46, align 4
  %.not13.not = icmp eq i64 %value_phi4, %21
  %47 = add nuw nsw i64 %value_phi4, 1
  br i1 %.not13.not, label %L61, label %L31

L61:                                              ; preds = %L31, %L18, %top
  ret {}* %4
}

In summary, branchless code isn't always better than code with branches.

@emmt
Copy link
Contributor Author

emmt commented Jul 23, 2021

I am really impressed by all these checks! Following the guidelines by @ChenNingCong, I have written the following code:

module TestIssue40276

using BenchmarkTools

function compute_weights(t::T) where {T<:AbstractFloat}
    u = 1 - t
    v = (T(-1)/2)*t*u
    w1 = v*u
    w4 = v*t
    w = w4 - w1
    w2 = u - w1 + w
    w3 = t - w4 - w
    return (w1, w2, w3, w4)
end

function compute_weights_1(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    @assert size(dst) == (4,length(src))
    @inbounds @simd for i in eachindex(src)
        w1, w2, w3, w4 = compute_weights(src[i])
        dst[1,i] = w1
        dst[2,i] = w2
        dst[3,i] = w3
        dst[4,i] = w4
    end
    return dst
end

function compute_weights_2(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    if size(dst) == (4, length(src))
        @inbounds @simd for i in eachindex(src)
            w1, w2, w3, w4 = compute_weights(src[i])
            dst[1,i] = w1
            dst[2,i] = w2
            dst[3,i] = w3
            dst[4,i] = w4
        end
    else
        throw(AssertionError())
    end
    return dst
end

function compute_weights_3(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    m,n = size(dst)
    if m == 4 && n == length(src)
        @inbounds @simd for i in eachindex(src)
            w1, w2, w3, w4 = compute_weights(src[i])
            dst[1,i] = w1
            dst[2,i] = w2
            dst[3,i] = w3
            dst[4,i] = w4
        end
    else
        throw(AssertionError())
    end
    return dst
end

function compute_weights_4(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    @assert size(dst) == (4,length(src))
    if size(dst,1) == 4
        @inbounds @simd for i in eachindex(src)
            w1, w2, w3, w4 = compute_weights(src[i])
            dst[1,i] = w1
            dst[2,i] = w2
            dst[3,i] = w3
            dst[4,i] = w4
        end
    end
    return dst
end

# With computed offset.
function compute_weights_5(dst::Array{T,2}, src::Array{T,1}) where {T<:AbstractFloat}
    @assert size(dst) == (4,length(src))
    @inbounds @simd for i in eachindex(src)
        w1, w2, w3, w4 = compute_weights(src[i])
        j = 4*(i-1)
        dst[j+1] = w1
        dst[j+2] = w2
        dst[j+3] = w3
        dst[j+4] = w4
    end
    return dst
end

function runtests(; T::Type{<:AbstractFloat}=Float32, n::Int=1_000)
    t = rand(T, n) .- one(T)/2
    z = Array{T}(undef, 4, n)
    print("Tests with Julia-", VERSION, ", T=", T, ", n=", n, "\n")
    print("compute_weights_1"); @btime $(compute_weights_1)($z, $t)
    print("compute_weights_2"); @btime $(compute_weights_2)($z, $t)
    print("compute_weights_3"); @btime $(compute_weights_3)($z, $t)
    print("compute_weights_4"); @btime $(compute_weights_4)($z, $t)
    print("compute_weights_5"); @btime $(compute_weights_5)($z, $t)
    nothing
end

runtests(T=Float32)

end # module

and checked against the different Julia versions:

Tests with Julia-1.0.5, T=Float32, n=1000
compute_weights_1  963.478 ns (0 allocations: 0 bytes)
compute_weights_2  958.000 ns (0 allocations: 0 bytes)
compute_weights_3  963.043 ns (0 allocations: 0 bytes)
compute_weights_4  971.000 ns (0 allocations: 0 bytes)
compute_weights_5  961.111 ns (0 allocations: 0 bytes)

Tests with Julia-1.1.1, T=Float32, n=1000
compute_weights_1  943.200 ns (0 allocations: 0 bytes)
compute_weights_2  949.583 ns (0 allocations: 0 bytes)
compute_weights_3  943.846 ns (0 allocations: 0 bytes)
compute_weights_4  948.800 ns (0 allocations: 0 bytes)
compute_weights_5  944.348 ns (0 allocations: 0 bytes)

Tests with Julia-1.2.0, T=Float32, n=1000
compute_weights_1  955.217 ns (0 allocations: 0 bytes)
compute_weights_2  954.762 ns (0 allocations: 0 bytes)
compute_weights_3  954.348 ns (0 allocations: 0 bytes)
compute_weights_4  960.952 ns (0 allocations: 0 bytes)
compute_weights_5  955.217 ns (0 allocations: 0 bytes)

Tests with Julia-1.3.1, T=Float32, n=1000
compute_weights_1  963.500 ns (0 allocations: 0 bytes)
compute_weights_2  961.053 ns (0 allocations: 0 bytes)
compute_weights_3  954.762 ns (0 allocations: 0 bytes)
compute_weights_4  957.273 ns (0 allocations: 0 bytes)
compute_weights_5  958.000 ns (0 allocations: 0 bytes)

Tests with Julia-1.4.2, T=Float32, n=1000
compute_weights_1  948.462 ns (0 allocations: 0 bytes)
compute_weights_2  948.571 ns (0 allocations: 0 bytes)
compute_weights_3  953.913 ns (0 allocations: 0 bytes)
compute_weights_4  948.261 ns (0 allocations: 0 bytes)
compute_weights_5  954.000 ns (0 allocations: 0 bytes)

Tests with Julia-1.5.4, T=Float32, n=1000
compute_weights_1  954.000 ns (0 allocations: 0 bytes)
compute_weights_2  948.400 ns (0 allocations: 0 bytes)
compute_weights_3  948.400 ns (0 allocations: 0 bytes)
compute_weights_4  948.800 ns (0 allocations: 0 bytes)
compute_weights_5  948.750 ns (0 allocations: 0 bytes)

Tests with Julia-1.6.2, T=Float32, n=1000
compute_weights_1  1.427 μs (0 allocations: 0 bytes)
compute_weights_2  1.427 μs (0 allocations: 0 bytes)
compute_weights_3  955.909 ns (0 allocations: 0 bytes)
compute_weights_4  945.417 ns (0 allocations: 0 bytes)
compute_weights_5  945.000 ns (0 allocations: 0 bytes)

Tests with Julia-1.7.0-beta3.0, T=Float32, n=1000
compute_weights_1  1.416 μs (0 allocations: 0 bytes)
compute_weights_2  1.448 μs (0 allocations: 0 bytes)
compute_weights_3  945.417 ns (0 allocations: 0 bytes)
compute_weights_4  950.000 ns (0 allocations: 0 bytes)
compute_weights_5  955.000 ns (0 allocations: 0 bytes)

The issue only appears at Julia 1.6 and as pointed by @ChenNingCong, the two branch versions compute_weights_1 and compute_weights_2 are wrong while the others are ok. Choosing among the fast version is a matter of taste...

@emmt
Copy link
Contributor Author

emmt commented Jul 23, 2021

As a rule of thumb and to summarize my understanding:

  • For loop optimization, knowing that array strides have a specific value (here 4) is critical.
  • This knowledge can be acquirred by an explicit test, e.g. size(dst,1) == 4 or even m,n=size(dst); if m == 4 ... whether it is done in an @assert or in an if statement is not relevant. An assertion with @assert may however be disabled depending on the optimization level (see doc.) so one should avoid relying on that.
  • The issue (Performance regression in vectorized loops #40276) is that, for Julia ≥ 1.6, a check on a whole tuple of dimensions is not sufficient to acquire this knowledge, hence checking that size(dst) == (4,lenght(src)) is not a hint for the optimizer that size(dst,1) == 4. My guess is that this issue should affect other vectorizable loops with stridded arrays.

The following variants appear to be equally fast:

@assert size(dst,1) == 4 && size(dst,2) == length(src)
@inbounds @simd for i in eachindex(src)
   ... # same loop body
end

or (better because @assert may be disabled):

if size(dst,1) == 4 && size(dst,2) == length(src)
    @inbounds @simd for i in eachindex(src)
       ... # same loop body
    end
else
    throw(AssertionError())
end

or

m,n = size(dst)
if m == 4 && n == length(src)
    @inbounds @simd for i in eachindex(src)
       ... # same loop body
    end
else
    throw(AssertionError())
end

It is also possible to use the non short-circuit and, that is, (size(dst,1) == 4) & (size(dst,2) == length(src)).

emmt added a commit to emmt/LinearInterpolators.jl that referenced this issue Jul 29, 2021
emmt added a commit to emmt/LinearInterpolators.jl that referenced this issue May 26, 2022
This wins a factor better than ~30 to rotate a 512×512 image in Float64
with a CatmulRom kernel (of size 4) (226ms with many small allocations ->
7ms with no allocations).

This solves JuliaLang/julia#40276
@vtjnash
Copy link
Member

vtjnash commented Aug 24, 2023

Appears vectorized to me on master

@vtjnash vtjnash closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

4 participants