Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting is much slower than a for loop #28126

Open
YingboMa opened this issue Jul 15, 2018 · 39 comments
Open

Broadcasting is much slower than a for loop #28126

YingboMa opened this issue Jul 15, 2018 · 39 comments
Labels
broadcast Applying a function over a collection compiler:simd instruction-level vectorization performance Must go faster regression Regression in behavior compared to a previous version

Comments

@YingboMa
Copy link
Contributor

Here is a minimal working example.

julia> using BenchmarkTools

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e)
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i])
           end
           nothing
       end
goo (generic function with 1 method)

julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##9#10"))}(getfield(Main, Symbol("##9#10"))(), 1:5)

julia> @btime foo($a,$b,$c,$d,$e)
  1.277 μs (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  345.568 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 0.7.0-beta2.12
Commit a878341 (2018-07-15 15:57 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_PKG3_PRECOMPILE = 1
@fredrikekre fredrikekre added the performance Must go faster label Jul 15, 2018
@gasagna
Copy link
Contributor

gasagna commented Jul 15, 2018

Maybe related to #27988 ?

@KristofferC
Copy link
Member

There is no inference problem, the inbounds loop manages to simd while the broadcast doesn't. Removing the inbounds and they are equally fast.

@YingboMa
Copy link
Contributor Author

No, I don't think so. After applying this diff and rebuilding Julia, I got the same result.

julia> @btime foo($a,$b,$c,$d,$e)
  1.274 μs (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  348.897 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 0.7.0-beta2.12
Commit a878341* (2018-07-15 15:57 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_PKG3_PRECOMPILE = 1

@jebej
Copy link
Contributor

jebej commented Jul 15, 2018

Don't dot-ops do operations inbounds?

@YingboMa
Copy link
Contributor Author

Maybe related to #22255?

julia> using BenchmarkTools

julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##5#6"))}(getfield(Main, Symbol("##5#6"))(), 1:5)

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}) where T
           @. a = b + 0.2c + 0.3d
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.2c[i] + 0.3d[i]
           end
           nothing
       end
goo (generic function with 1 method)

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.2c + 0.3d + 0.4e
           nothing
       end
foo (generic function with 2 methods)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
           end
           nothing
       end
goo (generic function with 2 methods)

julia> @btime foo($a,$b,$c,$d)
  168.767 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d)
  166.587 ns (0 allocations: 0 bytes)

julia> @btime foo($a,$b,$c,$d,$e)
  1.135 μs (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  342.142 ns (0 allocations: 0 bytes)

@Keno
Copy link
Member

Keno commented Jul 15, 2018

Looks like LLVM is having some trouble with the array bounds in the broadcast example:

remark: <unknown>:0:0: loop not vectorized: cannot identify array bounds

@JeffBezanson JeffBezanson added the broadcast Applying a function over a collection label Jul 15, 2018
@Keno
Copy link
Member

Keno commented Jul 15, 2018

If I'm reading this correctly, the array pointer load is not getting LICM'ed. @vtjnash was looking at something related recently:

  %171 = load double*, double* addrspace(11)* %152, align 8
  %172 = getelementptr inbounds double, double* %171, i64 %value_phi1517.us.us
  store double %170, double* %172, align 8

@Keno
Copy link
Member

Keno commented Jul 15, 2018

Actually, my bad, I accidentally dropped TBAA.

@Keno
Copy link
Member

Keno commented Jul 16, 2018

Ok, the array bounds are still the issue though. Not sure why LLVM is unable to find them. We should probably fix that. However, there is a larger issue here. We go through all this trouble on the julia side to unalias the destination array from any of the arguments, but we don't actually tell LLVM that we did that, so it tries to insert a dynamic memory range check for which it then can't find the bounds. The better fix here is probably to figure out how to inform LLVM that the pointers don't alias.

@ChrisRackauckas
Copy link
Member

the inbounds loop manages to simd while the broadcast doesn't

If this is the case, why? Why not assert that the sizes work and then just @inbounds @simd (or let it auto-vectorize)?

@ChrisRackauckas
Copy link
Member

Maybe related to #22255?

It's not #22255 because if it was then the reduced size example would have no regression and the larger size example would have allocations.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 16, 2018

If this is the case, why? Why not assert that the sizes work and then just @inbounds @simd (or let it auto-vectorize)?

I don't know the details of how broadcasting is implemented (and all the LLVM interfacing is way beyond me), but as Keno elaborated above, LLVM isn't aware that the array's aren't aliased while broadcasting, and therefore doesn't auto-vectorize (if b[i+1] refers to the same element as a[i] in @. a = b + 0.1 * (0.2c + 0.3d + 0.4e), you'd obviously have to calculate each a[i] in order, instead of in SIMD chunks).

@ChrisRackauckas
Copy link
Member

Would #19658 be a good solution?

@chriselrod
Copy link
Contributor

chriselrod commented Jul 18, 2018

I think having @assert_nolias would be pretty cool. I struggled with getting the compiler to vectorize while messing with matmul.
But, hopefully it'll be unnecessary for plain old array data types. In there Keno says:

Eventually we may want to introduce more first class support for non-aliasing arrays (which array almost are, were it not for sharing by reshaping etc).

I think that's why we now have reinterpret and reshaped arrays, to try and make it so whenever
isa(a, Array) && isa(b, Array) && !(a === b)
they'd be guaranteed not to alias, and no runtime checks are needed.

I thought we were already there, but I realized unsafe_wrap still creates a plain old array. To test usnafe_wrap, and whether alias checks are happening...

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.2c + 0.3d + 0.4e
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
           end
           nothing
       end
goo (generic function with 1 method)

julia> c = randn(80);  d = randn(80);  e = randn(80);

julia> x1 = randn(81);

julia> x2 = copy(x1); x3 = copy(x1); #x3 backup, if we want to double check answers later

julia> px1 = pointer(x1); px2 = pointer(x2);

julia> a1 = unsafe_wrap(Vector{Float64}, px1 + 8, 80); # a[i] = b[i+1]

julia> a2 = unsafe_wrap(Vector{Float64}, px2 + 8, 80);

julia> b1 = unsafe_wrap(Vector{Float64}, px1, 80);

julia> b2 = unsafe_wrap(Vector{Float64}, px2, 80);

julia> foo(a1, b1, c, d, e)

julia> goo(a2, b2, c, d, e)

julia> a1'
1×80 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.805164  -0.134631  -0.696007  -0.39507  0.139368  0.209027  0.992377    -0.288711  -0.417188  -0.330905  -0.518275  -0.0962622  0.0562025

julia> a2'
1×80 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.805164  -0.134631  -0.696007  -0.39507  0.139368  0.209027  0.992377    -0.288711  -0.417188  -0.330905  -0.518275  -0.0962622  0.0562025

but they agreed!

Benchmarking also supports that they got caught by a runtime check:

julia> using BenchmarkTools

julia> @btime foo($a1, $b1, $c, $d, $e)
  304.583 ns (0 allocations: 0 bytes)

julia> @btime goo($a2, $b2, $c, $d, $e)
  284.393 ns (0 allocations: 0 bytes)

julia> a3 = randn(80); b3 = randn(80);

julia> @btime foo($a3, $b3, $c, $d, $e)
  89.477 ns (0 allocations: 0 bytes)

julia> @btime goo($a3, $b3, $c, $d, $e)
  24.370 ns (0 allocations: 0 bytes)

julia> function goo2(a::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) - 1 == length(c) == length(d) == length(e)
           # Make aliasing obvious
           @inbounds for i in eachindex(c)
               a[i+1] = a[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
           end
           nothing
       end
goo2 (generic function with 1 method)

julia> a4 = randn(81);

julia> @btime goo2($a4, $c, $d, $e)
  192.453 ns (0 allocations: 0 bytes)

The clearly aliasing version was much faster (195ns vs 285ns) than the disguised alias. Both much slower of course than no aliasing.

What all is left, before the compiler can be promised different arrays never alias? Will that happen, and how much does it matter (ie, what is the cost of runtime checking)?

@vtjnash
Copy link
Member

vtjnash commented Jul 18, 2018

Would #19658 be a good solution?

I'm not sure – this used to roughly be the meaning of the @simd macro, but we recently removed it since we found that @no_alias also essentially ends up asserting that no data may be read from the destination as well ("self-aliasing"), and there didn't seem to be a good way to express that contract in general. This was OK for the basic Array type, but was breaking generic code when it happened to get passed a different AbstractArray, such as a BitArray.

@KristofferC
Copy link
Member

I think that's why we now have reinterpret and reshaped arrays, to try and make it so whenever
isa(a, Array) && isa(b, Array) && !(a === b)
they'd be guaranteed not to alias, and no runtime checks are needed.

julia> a = rand(2,2); b = vec(a);

julia> isa(a, Array) && isa(b, Array) && !(a === b)
true

julia> a[1,1] = 2
2

julia> b[1]
2.0

@chriselrod
Copy link
Contributor

That's a much less convoluted example than the unsafe_wraps.
I also realize now:

julia> x = rand(2,2,2);

julia> reshape(x,2,4)
2×4 Array{Float64,2}:
 0.575181  0.154845  0.845574  0.772895
 0.722177  0.329466  0.258449  0.718844

I realize now the ReshapedArray type only applies to AbstractArrays?

I was thinking of this comment: https://discourse.julialang.org/t/use-of-pointer/8849/9
so I thought there was an effort to disallow aliasing in that way.
Any good issues/prs that show some discussion?
I don't know all these low level / compiler details, but am trying to get some grasp of what's going on.

@giordano
Copy link
Contributor

giordano commented Oct 5, 2020

Interestingly enough, when the vectors are much larger (10k x those used by @YingboMa in the original benchmarks), broadcasting and for loop are equally fast (or slow?):

julia> a,b,c,d,e=(rand(10_000_000) for i in 1:5)
Base.Generator{UnitRange{Int64}, var"#28#29"}(var"#28#29"(), 1:5)

julia> @btime foo($a,$b,$c,$d,$e)
  32.563 ms (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  32.485 ms (0 allocations: 0 bytes)

@YingboMa
Copy link
Contributor Author

YingboMa commented Oct 6, 2020

Yeah, when the data is large, then the performance is bounded by memory not compute.

@chriselrod
Copy link
Contributor

You can see this by comparing with benchmarks of shorter vectors:

julia> a,b,c,d,e=(rand(10_000_000) for i in 1:5);

julia> @btime foo($a,$b,$c,$d,$e)
  28.240 ms (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  25.086 ms (0 allocations: 0 bytes)

julia> a,b,c,d,e=(rand(512) for i in 1:5);

julia> @btime foo($a,$b,$c,$d,$e)
  648.171 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  73.751 ns (0 allocations: 0 bytes)

julia> ((28.24e-3, 25.086e-3) ./ (648.171e-9, 73.751e-9)) ./ (10_000_000 / 512)
(2.230719979758428, 17.41540046914618)

So by shortening the vectors from 10 million to 512 elements, I get 2x and 17x faster results relative to the number of elements.

@giordano
Copy link
Contributor

giordano commented Oct 6, 2020

Yeah, after writing the message I realised that I was getting a very bad scaling compared the initial 1000-element example (10k x more elements but 22k x slower, so the same 2x factor you're seeing), but didn't think that performance was actually bound by memory. I should have probably have gone to sleep yesterday night instead of doing dumb benchmarks 😆

@ViralBShah
Copy link
Member

@YingboMa Close?

@YingboMa
Copy link
Contributor Author

This is not resolved.

@ViralBShah ViralBShah changed the title Broadcasting is much slower than a for loop on Julia 0.7 Broadcasting is much slower than a for loop Mar 11, 2022
@chriselrod
Copy link
Contributor

To resolve this issue for most cases, we'd need to take a FastBroadcast-like approach of generating at least two versions of the loop explicitly, so that fast code can be generated for at least 1 case.

@YingboMa
Copy link
Contributor Author

That's what #30973 does.

@vtjnash
Copy link
Member

vtjnash commented Aug 24, 2023

julia> @btime foo($a,$b,$c,$d,$e)
  313.348 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  305.598 ns (0 allocations: 0 bytes)

julia> @btime foo($a,$b,$c,$d)
  142.696 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d)
  141.994 ns (0 allocations: 0 bytes)

julia> @btime foo($a,$b,$c,$d,$e)
  337.626 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  330.529 ns (0 allocations: 0 bytes)

@vtjnash vtjnash closed this as completed Aug 24, 2023
@jishnub
Copy link
Contributor

jishnub commented Aug 25, 2023

Is this fixed on v1.10 as well, or is scheduled to be backported?

EDIT: it seems to be fixed.

julia> using BenchmarkTools

julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e)
           nothing
       end
foo (generic function with 1 method)

julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
           @assert length(a) == length(b) == length(c) == length(d) == length(e)
           @inbounds for i in eachindex(a)
               a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i])
           end
           nothing
       end
goo (generic function with 1 method)

julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64}, var"#1#2"}(var"#1#2"(), 1:5)

julia> @btime foo($a,$b,$c,$d,$e)
  406.280 ns (0 allocations: 0 bytes)

julia> @btime goo($a,$b,$c,$d,$e)
  404.220 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.10.0-beta2
Commit a468aa198d0 (2023-08-17 06:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
Environment:
  LD_LIBRARY_PATH = :/usr/lib/x86_64-linux-gnu/gtk-3.0/modules
  JULIA_EDITOR = subl

@chriselrod
Copy link
Contributor

chriselrod commented Aug 28, 2023

Looking at FastBroadcast.jl's vectorization tests (foo uses FastBroadcast, bar uses Base broadcast):

julia> using MuladdMacro, FastBroadcast

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @muladd @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> function foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
           @muladd @.. a =
               b +
               0.1 * (
                   0.2c +
                   0.3d +
                   0.4e +
                   0.5f +
                   0.6g +
                   0.6h +
                   0.6i +
                   0.6j +
                   0.6k +
                   0.6l +
                   0.6m +
                   0.6n +
                   0.6o +
                   0.6p +
                   0.6q +
                   0.6r +
                   0.6s +
                   0.6t +
                   0.6u +
                   0.6v +
                   0.6w +
                   0.6x +
                   0.6y +
                   0.6z
               )
           nothing
       end
foo26 (generic function with 1 method)

julia> function bar9(a, b, c, d, e, f, g, h, i)
           @muladd @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
bar9 (generic function with 1 method)

julia> function bar26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
           @muladd @. a =
               b +
               0.1 * (
                   0.2c +
                   0.3d +
                   0.4e +
                   0.5f +
                   0.6g +
                   0.6h +
                   0.6i +
                   0.6j +
                   0.6k +
                   0.6l +
                   0.6m +
                   0.6n +
                   0.6o +
                   0.6p +
                   0.6q +
                   0.6r +
                   0.6s +
                   0.6t +
                   0.6u +
                   0.6v +
                   0.6w +
                   0.6x +
                   0.6y +
                   0.6z
               )
           nothing
       end
bar26 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z =
           [rand(100, 100, 2) for i = 1:26];

julia> foo9(a, b, c, d, e, f, g, h, i)

julia> foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)

julia> bar9(a, b, c, d, e, f, g, h, i)

julia> bar26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)

julia> @benchmark foo9($a, $b, $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  44.698 μs  100.563 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     45.143 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.254 μs ± 727.554 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▄▆▇█▆▃▁                                                
  ▂▂▂▃▄▆████████▇▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  44.7 μs         Histogram: frequency by time         47.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar9($a, $b, $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  88.260 μs  105.204 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     88.657 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.839 μs ± 651.841 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▄▅▇▇██▇▇▆▄▂                               ▁    ▁▁▂▂▂▂▂▂▁▁ ▃
  ▆██████████████▇▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▅▄▇▇▇▇████▇████████████ █
  88.3 μs       Histogram: log(frequency) by time      91.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 6248 samples with 1 evaluation.
 Range (min  max):  154.994 μs  175.292 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     156.067 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.352 μs ±   1.123 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▇█▆▂          ▁▃▃▃▃▂▁                           ▁
  ▃▃▄▆███▇▇▇▇▇██████▇▇▇▅▅▅▅▄▅▅█████████▇███▇▆▅▅▄▄▃▅▃▁▁▁▁▁▃▁▁▆█▇ █
  155 μs        Histogram: log(frequency) by time        159 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 1994 samples with 1 evaluation.
 Range (min  max):  494.945 μs  524.828 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     496.478 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   497.447 μs ±   1.812 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▄▄██▆▄▁                              ▁▃▃▂▅ ▂▁           
  ▂▂▃▃▆▇█████████▆▅▃▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▅▆▆█████████▇▄▅▄▃▃▂▂▂ ▄
  495 μs           Histogram: frequency by time          500 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT:
Just to confirm, specifying broadcast=true (so that it actually has the same semantics as base broadcasting) doesn't hurt performance:

julia> function foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
           @muladd @.. broadcast=true a =
               b +
               0.1 * (
                   0.2c +
                   0.3d +
                   0.4e +
                   0.5f +
                   0.6g +
                   0.6h +
                   0.6i +
                   0.6j +
                   0.6k +
                   0.6l +
                   0.6m +
                   0.6n +
                   0.6o +
                   0.6p +
                   0.6q +
                   0.6r +
                   0.6s +
                   0.6t +
                   0.6u +
                   0.6v +
                   0.6w +
                   0.6x +
                   0.6y +
                   0.6z
               )
           nothing
       end
foo26 (generic function with 1 method)

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @muladd @.. broadcast=true a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> @benchmark foo9($a, $b, $c, $d, $e, $f, $g, $h, $i)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  44.795 μs   55.059 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     45.219 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.327 μs ± 479.755 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃▄▅▆▇█████▇▆▅▄▄▃▃▂▂▁▂▁▂▁▁                      ▁▁▁▁▁  ▁▁   ▃
  ▆███████████████████████████▇▇▅▆▁▃▁▃▁▁▃▁▄▁▃▅▅▆▇████████████▇ █
  44.8 μs       Histogram: log(frequency) by time      47.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark foo26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 6227 samples with 1 evaluation.
 Range (min  max):  154.880 μs  178.189 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     156.287 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.849 μs ±   1.892 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁ ▂█▄  ▁▄▃▁▁     ▁▂                                        ▁
  ▃▄███████▆██████▆▅▆▄██▃▃▃▆▇▅▄▄▄▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▄▅▆▅▆▅▅▅▅▇▆ █
  155 μs        Histogram: log(frequency) by time        168 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Meaning that this is an implementation deficiency in base broadcasting.

@chriselrod chriselrod reopened this Aug 28, 2023
@N5N3
Copy link
Member

N5N3 commented Aug 29, 2023

IIUC, FastBroadcast has a fast branch for non-broadcast cases (args with identity axes).
Once we try some "real" broadcasting, FastBroadcast might become (slightly) slower than Base.broadcast

function foo9(a, b, c, d, e, f, g, h, i)
    @. @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
    nothing
end

function bar9(a, b, c, d, e, f, g, h, i)
    @.. broadcast = true @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
    nothing
end

a, b, c, d, e, f, g, h, i = [rand(1024) for i = 1:26];

julia> @benchmark foo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.150 μs   18.840 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.210 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.348 μs ± 506.534 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅█▅▂▂▁ ▁    ▂▂▁▁▁ ▁   ▁  ▂ ▃  ▃  ▂  ▁                      ▂
  ██████████▇█████████▇████▅█▇██▅██▅█▆▅█▄▄▆▄▂▄▃▄▄▄▄▄▄▂▅▄▅▄▂▃▇ █
  1.15 μs      Histogram: log(frequency) by time      2.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.422 μs  340.733 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.969 μs ±   3.826 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▂█▃▃▂▅▃▂▁▂▁▁ ▁▁ ▁                ▁                         ▁
  ████████████████▇█▇▅▇▅█▅█▅█▅▅█▅█▅▃█▄▃▇▃▄▃▃▄▄▂▅▄▄▄▄▄▄▃▃▄▄▅▃▅ █
  2.42 μs      Histogram: log(frequency) by time      7.16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Compared with the result before the LLVM upgrading and pass tuning, now LLVM does generate vectorization code even there are more than 3 dynamic axes1.
But apparently, it still not vectorize all axes1 combinations to avoid possible code bloat. And the vectorization might fail for high dimension cases.

@vtjnash vtjnash closed this as completed Aug 29, 2023
@vtjnash
Copy link
Member

vtjnash commented Aug 29, 2023

Thanks N5N3 for checking

@giordano
Copy link
Contributor

Is there a way to make sure this doesn't regress again? An llvm test for example?

@chriselrod
Copy link
Contributor

IIUC, FastBroadcast has a fast branch for non-broadcast cases (args with identity axes).
Once we try some "real" broadcasting, FastBroadcast might become (slightly) slower than Base.broadcast

I am aware. Getting worse performance in "real broadcasting" was a deliberate choice that I introduced myself, because I wanted to avoid excessive code generation/loop multiversioning by LLVM for cases I and the SciML ecosystem do not prioritize.

If we wanted to rice benchmarks at the cost of code generation, we could revert that and have FB dominate base broadcast's runtime performance.

@chriselrod
Copy link
Contributor

chriselrod commented Aug 29, 2023

@N5N3 the original issue title was "Broadcasting is much slower than a for loop".

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @. @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> function bar9(a, b, c, d, e, f, g, h, i)
           @.. broadcast = true @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
bar9 (generic function with 1 method)

julia> function loo9(a, b, c, d, e, f, g, h, i)
           @fastmath for ii = eachindex(a) 
             a[ii] = b[1] + 0.1 * (0.2c[ii] + 0.3d[ii] + 0.4e[ii] + 0.5f[ii] + 0.6g[ii] + 0.6h[ii] + 0.6i[ii])
           end
       end
loo9 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i = [rand(1024) for i = 1:26];

julia> @benchmark foo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.560 μs   1.846 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.568 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.570 μs ± 15.111 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁█▅                                                       
  ▂▃███▆▃▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▂ ▂
  1.56 μs        Histogram: frequency by time        1.67 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark bar9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.245 μs    9.063 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     4.247 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.334 μs ± 541.530 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▂                                                        ▁
  █▁██▆▃▄▃▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▄▃▁▃▄▄▄▁▃▄▄▄▄▅▄▄▄▄▄▄▅▅▅▆▆▇ █
  4.25 μs      Histogram: log(frequency) by time      8.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark loo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 195 evaluations.
 Range (min  max):  490.226 ns  710.538 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     492.646 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   498.221 ns ±  13.150 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▅▂▃▂ ▂▄▄▃▂▄▅▃▂▁▂▁ ▁▃▄▃▃▁▁                                  ▂
  ███████▇████████████▇███████▆██▇█▅▅▇▆▆▆▆█▆▄▅█▇▃▅█▅▃▁▅▄▃▁▅▃▃▁▆ █
  490 ns        Histogram: log(frequency) by time        544 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@ChrisRackauckas
Copy link
Member

Reopening for regression tests.

IIUC, FastBroadcast has a fast branch for non-broadcast cases (args with identity axes).

I'd argue that Base.broadcast should get this fast path, as using broadcast for map operations is quite common and so getting a major performance hit in the standard case isn't great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection compiler:simd instruction-level vectorization performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests