-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcasting is much slower than a for loop #28126
Comments
Maybe related to #27988 ? |
There is no inference problem, the inbounds loop manages to simd while the broadcast doesn't. Removing the inbounds and they are equally fast. |
No, I don't think so. After applying this diff and rebuilding Julia, I got the same result. julia> @btime foo($a,$b,$c,$d,$e)
1.274 μs (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d,$e)
348.897 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 0.7.0-beta2.12
Commit a878341* (2018-07-15 15:57 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
JULIA_PKG3_PRECOMPILE = 1 |
Don't dot-ops do operations inbounds? |
Maybe related to #22255? julia> using BenchmarkTools
julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##5#6"))}(getfield(Main, Symbol("##5#6"))(), 1:5)
julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}) where T
@. a = b + 0.2c + 0.3d
nothing
end
foo (generic function with 1 method)
julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}) where T
@assert length(a) == length(b) == length(c) == length(d)
@inbounds for i in eachindex(a)
a[i] = b[i] + 0.2c[i] + 0.3d[i]
end
nothing
end
goo (generic function with 1 method)
julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@. a = b + 0.2c + 0.3d + 0.4e
nothing
end
foo (generic function with 2 methods)
julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@assert length(a) == length(b) == length(c) == length(d) == length(e)
@inbounds for i in eachindex(a)
a[i] = b[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
end
nothing
end
goo (generic function with 2 methods)
julia> @btime foo($a,$b,$c,$d)
168.767 ns (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d)
166.587 ns (0 allocations: 0 bytes)
julia> @btime foo($a,$b,$c,$d,$e)
1.135 μs (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d,$e)
342.142 ns (0 allocations: 0 bytes) |
Looks like LLVM is having some trouble with the array bounds in the broadcast example:
|
If I'm reading this correctly, the array pointer load is not getting LICM'ed. @vtjnash was looking at something related recently:
|
Actually, my bad, I accidentally dropped TBAA. |
Ok, the array bounds are still the issue though. Not sure why LLVM is unable to find them. We should probably fix that. However, there is a larger issue here. We go through all this trouble on the julia side to unalias the destination array from any of the arguments, but we don't actually tell LLVM that we did that, so it tries to insert a dynamic memory range check for which it then can't find the bounds. The better fix here is probably to figure out how to inform LLVM that the pointers don't alias. |
If this is the case, why? Why not assert that the sizes work and then just |
I don't know the details of how broadcasting is implemented (and all the LLVM interfacing is way beyond me), but as Keno elaborated above, LLVM isn't aware that the array's aren't aliased while broadcasting, and therefore doesn't auto-vectorize (if b[i+1] refers to the same element as a[i] in |
Would #19658 be a good solution? |
I think having
I think that's why we now have reinterpret and reshaped arrays, to try and make it so whenever I thought we were already there, but I realized julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@. a = b + 0.2c + 0.3d + 0.4e
nothing
end
foo (generic function with 1 method)
julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@assert length(a) == length(b) == length(c) == length(d) == length(e)
@inbounds for i in eachindex(a)
a[i] = b[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
end
nothing
end
goo (generic function with 1 method)
julia> c = randn(80); d = randn(80); e = randn(80);
julia> x1 = randn(81);
julia> x2 = copy(x1); x3 = copy(x1); #x3 backup, if we want to double check answers later
julia> px1 = pointer(x1); px2 = pointer(x2);
julia> a1 = unsafe_wrap(Vector{Float64}, px1 + 8, 80); # a[i] = b[i+1]
julia> a2 = unsafe_wrap(Vector{Float64}, px2 + 8, 80);
julia> b1 = unsafe_wrap(Vector{Float64}, px1, 80);
julia> b2 = unsafe_wrap(Vector{Float64}, px2, 80);
julia> foo(a1, b1, c, d, e)
julia> goo(a2, b2, c, d, e)
julia> a1'
1×80 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
0.805164 -0.134631 -0.696007 -0.39507 0.139368 0.209027 0.992377 … -0.288711 -0.417188 -0.330905 -0.518275 -0.0962622 0.0562025
julia> a2'
1×80 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
0.805164 -0.134631 -0.696007 -0.39507 0.139368 0.209027 0.992377 … -0.288711 -0.417188 -0.330905 -0.518275 -0.0962622 0.0562025 but they agreed! Benchmarking also supports that they got caught by a runtime check: julia> using BenchmarkTools
julia> @btime foo($a1, $b1, $c, $d, $e)
304.583 ns (0 allocations: 0 bytes)
julia> @btime goo($a2, $b2, $c, $d, $e)
284.393 ns (0 allocations: 0 bytes)
julia> a3 = randn(80); b3 = randn(80);
julia> @btime foo($a3, $b3, $c, $d, $e)
89.477 ns (0 allocations: 0 bytes)
julia> @btime goo($a3, $b3, $c, $d, $e)
24.370 ns (0 allocations: 0 bytes)
julia> function goo2(a::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@assert length(a) - 1 == length(c) == length(d) == length(e)
# Make aliasing obvious
@inbounds for i in eachindex(c)
a[i+1] = a[i] + 0.2c[i] + 0.3d[i] + 0.4e[i]
end
nothing
end
goo2 (generic function with 1 method)
julia> a4 = randn(81);
julia> @btime goo2($a4, $c, $d, $e)
192.453 ns (0 allocations: 0 bytes) The clearly aliasing version was much faster (195ns vs 285ns) than the disguised alias. Both much slower of course than no aliasing. What all is left, before the compiler can be promised different arrays never alias? Will that happen, and how much does it matter (ie, what is the cost of runtime checking)? |
I'm not sure – this used to roughly be the meaning of the |
julia> a = rand(2,2); b = vec(a);
julia> isa(a, Array) && isa(b, Array) && !(a === b)
true
julia> a[1,1] = 2
2
julia> b[1]
2.0 |
That's a much less convoluted example than the unsafe_wraps. julia> x = rand(2,2,2);
julia> reshape(x,2,4)
2×4 Array{Float64,2}:
0.575181 0.154845 0.845574 0.772895
0.722177 0.329466 0.258449 0.718844 I realize now the ReshapedArray type only applies to AbstractArrays? I was thinking of this comment: https://discourse.julialang.org/t/use-of-pointer/8849/9 |
Interestingly enough, when the vectors are much larger (10k x those used by @YingboMa in the original benchmarks), broadcasting and for loop are equally fast (or slow?): julia> a,b,c,d,e=(rand(10_000_000) for i in 1:5)
Base.Generator{UnitRange{Int64}, var"#28#29"}(var"#28#29"(), 1:5)
julia> @btime foo($a,$b,$c,$d,$e)
32.563 ms (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d,$e)
32.485 ms (0 allocations: 0 bytes) |
Yeah, when the data is large, then the performance is bounded by memory not compute. |
You can see this by comparing with benchmarks of shorter vectors: julia> a,b,c,d,e=(rand(10_000_000) for i in 1:5);
julia> @btime foo($a,$b,$c,$d,$e)
28.240 ms (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d,$e)
25.086 ms (0 allocations: 0 bytes)
julia> a,b,c,d,e=(rand(512) for i in 1:5);
julia> @btime foo($a,$b,$c,$d,$e)
648.171 ns (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d,$e)
73.751 ns (0 allocations: 0 bytes)
julia> ((28.24e-3, 25.086e-3) ./ (648.171e-9, 73.751e-9)) ./ (10_000_000 / 512)
(2.230719979758428, 17.41540046914618) So by shortening the vectors from 10 million to 512 elements, I get 2x and 17x faster results relative to the number of elements. |
Yeah, after writing the message I realised that I was getting a very bad scaling compared the initial 1000-element example (10k x more elements but 22k x slower, so the same 2x factor you're seeing), but didn't think that performance was actually bound by memory. I should have probably have gone to sleep yesterday night instead of doing dumb benchmarks 😆 |
@YingboMa Close? |
This is not resolved. |
To resolve this issue for most cases, we'd need to take a FastBroadcast-like approach of generating at least two versions of the loop explicitly, so that fast code can be generated for at least 1 case. |
That's what #30973 does. |
|
Is this fixed on v1.10 as well, or is scheduled to be backported? EDIT: it seems to be fixed. julia> using BenchmarkTools
julia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@. a = b + 0.1 * (0.2c + 0.3d + 0.4e)
nothing
end
foo (generic function with 1 method)
julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T
@assert length(a) == length(b) == length(c) == length(d) == length(e)
@inbounds for i in eachindex(a)
a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i])
end
nothing
end
goo (generic function with 1 method)
julia> a,b,c,d,e=(rand(1000) for i in 1:5)
Base.Generator{UnitRange{Int64}, var"#1#2"}(var"#1#2"(), 1:5)
julia> @btime foo($a,$b,$c,$d,$e)
406.280 ns (0 allocations: 0 bytes)
julia> @btime goo($a,$b,$c,$d,$e)
404.220 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.10.0-beta2
Commit a468aa198d0 (2023-08-17 06:27 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 1 on 8 virtual cores
Environment:
LD_LIBRARY_PATH = :/usr/lib/x86_64-linux-gnu/gtk-3.0/modules
JULIA_EDITOR = subl |
Looking at FastBroadcast.jl's vectorization tests ( julia> using MuladdMacro, FastBroadcast
julia> function foo9(a, b, c, d, e, f, g, h, i)
@muladd @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
foo9 (generic function with 1 method)
julia> function foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
@muladd @.. a =
b +
0.1 * (
0.2c +
0.3d +
0.4e +
0.5f +
0.6g +
0.6h +
0.6i +
0.6j +
0.6k +
0.6l +
0.6m +
0.6n +
0.6o +
0.6p +
0.6q +
0.6r +
0.6s +
0.6t +
0.6u +
0.6v +
0.6w +
0.6x +
0.6y +
0.6z
)
nothing
end
foo26 (generic function with 1 method)
julia> function bar9(a, b, c, d, e, f, g, h, i)
@muladd @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
bar9 (generic function with 1 method)
julia> function bar26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
@muladd @. a =
b +
0.1 * (
0.2c +
0.3d +
0.4e +
0.5f +
0.6g +
0.6h +
0.6i +
0.6j +
0.6k +
0.6l +
0.6m +
0.6n +
0.6o +
0.6p +
0.6q +
0.6r +
0.6s +
0.6t +
0.6u +
0.6v +
0.6w +
0.6x +
0.6y +
0.6z
)
nothing
end
bar26 (generic function with 1 method)
julia> a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z =
[rand(100, 100, 2) for i = 1:26];
julia> foo9(a, b, c, d, e, f, g, h, i)
julia> foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
julia> bar9(a, b, c, d, e, f, g, h, i)
julia> bar26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
julia> @benchmark foo9($a, $b, $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 44.698 μs … 100.563 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.143 μs ┊ GC (median): 0.00%
Time (mean ± σ): 45.254 μs ± 727.554 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆▇█▆▃▁
▂▂▂▃▄▆████████▇▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
44.7 μs Histogram: frequency by time 47.2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark bar9($a, $b, $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 88.260 μs … 105.204 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 88.657 μs ┊ GC (median): 0.00%
Time (mean ± σ): 88.839 μs ± 651.841 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▇▇██▇▇▆▄▂ ▁ ▁▁▂▂▂▂▂▂▁▁ ▃
▆██████████████▇▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▅▄▇▇▇▇████▇████████████ █
88.3 μs Histogram: log(frequency) by time 91.1 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foo26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 6248 samples with 1 evaluation.
Range (min … max): 154.994 μs … 175.292 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 156.067 μs ┊ GC (median): 0.00%
Time (mean ± σ): 156.352 μs ± 1.123 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▇█▆▂ ▁▃▃▃▃▂▁ ▁
▃▃▄▆███▇▇▇▇▇██████▇▇▇▅▅▅▅▄▅▅█████████▇███▇▆▅▅▄▄▃▅▃▁▁▁▁▁▃▁▁▆█▇ █
155 μs Histogram: log(frequency) by time 159 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark bar26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 1994 samples with 1 evaluation.
Range (min … max): 494.945 μs … 524.828 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 496.478 μs ┊ GC (median): 0.00%
Time (mean ± σ): 497.447 μs ± 1.812 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▄██▆▄▁ ▁▃▃▂▅ ▂▁
▂▂▃▃▆▇█████████▆▅▃▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▅▆▆█████████▇▄▅▄▃▃▂▂▂ ▄
495 μs Histogram: frequency by time 500 μs <
Memory estimate: 0 bytes, allocs estimate: 0. EDIT: julia> function foo26(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z)
@muladd @.. broadcast=true a =
b +
0.1 * (
0.2c +
0.3d +
0.4e +
0.5f +
0.6g +
0.6h +
0.6i +
0.6j +
0.6k +
0.6l +
0.6m +
0.6n +
0.6o +
0.6p +
0.6q +
0.6r +
0.6s +
0.6t +
0.6u +
0.6v +
0.6w +
0.6x +
0.6y +
0.6z
)
nothing
end
foo26 (generic function with 1 method)
julia> function foo9(a, b, c, d, e, f, g, h, i)
@muladd @.. broadcast=true a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
foo9 (generic function with 1 method)
julia> @benchmark foo9($a, $b, $c, $d, $e, $f, $g, $h, $i)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 44.795 μs … 55.059 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.219 μs ┊ GC (median): 0.00%
Time (mean ± σ): 45.327 μs ± 479.755 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▅▆▇█████▇▆▅▄▄▃▃▂▂▁▂▁▂▁▁ ▁▁▁▁▁ ▁▁ ▃
▆███████████████████████████▇▇▅▆▁▃▁▃▁▁▃▁▄▁▃▅▅▆▇████████████▇ █
44.8 μs Histogram: log(frequency) by time 47.3 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark foo26($a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n, $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z)
BenchmarkTools.Trial: 6227 samples with 1 evaluation.
Range (min … max): 154.880 μs … 178.189 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 156.287 μs ┊ GC (median): 0.00%
Time (mean ± σ): 156.849 μs ± 1.892 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▂█▄ ▁▄▃▁▁ ▁▂ ▁
▃▄███████▆██████▆▅▆▄██▃▃▃▆▇▅▄▄▄▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▄▅▆▅▆▅▅▅▅▇▆ █
155 μs Histogram: log(frequency) by time 168 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Meaning that this is an implementation deficiency in base broadcasting. |
IIUC, function foo9(a, b, c, d, e, f, g, h, i)
@. @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
function bar9(a, b, c, d, e, f, g, h, i)
@.. broadcast = true @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
a, b, c, d, e, f, g, h, i = [rand(1024) for i = 1:26];
julia> @benchmark foo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.150 μs … 18.840 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.210 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.348 μs ± 506.534 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅█▅▂▂▁ ▁ ▂▂▁▁▁ ▁ ▁ ▂ ▃ ▃ ▂ ▁ ▂
██████████▇█████████▇████▅█▇██▅██▅█▆▅█▄▄▆▄▂▄▃▄▄▄▄▄▄▂▅▄▅▄▂▃▇ █
1.15 μs Histogram: log(frequency) by time 2.69 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark bar9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.422 μs … 340.733 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.600 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.969 μs ± 3.826 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▂█▃▃▂▅▃▂▁▂▁▁ ▁▁ ▁ ▁ ▁
████████████████▇█▇▅▇▅█▅█▅█▅▅█▅█▅▃█▄▃▇▃▄▃▃▄▄▂▅▄▄▄▄▄▄▃▃▄▄▅▃▅ █
2.42 μs Histogram: log(frequency) by time 7.16 μs <
Memory estimate: 0 bytes, allocs estimate: 0. Compared with the result before the LLVM upgrading and pass tuning, now LLVM does generate vectorization code even there are more than 3 dynamic |
Thanks N5N3 for checking |
Is there a way to make sure this doesn't regress again? An llvm test for example? |
I am aware. Getting worse performance in "real broadcasting" was a deliberate choice that I introduced myself, because I wanted to avoid excessive code generation/loop multiversioning by LLVM for cases I and the SciML ecosystem do not prioritize. If we wanted to rice benchmarks at the cost of code generation, we could revert that and have FB dominate base broadcast's runtime performance. |
@N5N3 the original issue title was "Broadcasting is much slower than a for loop". julia> function foo9(a, b, c, d, e, f, g, h, i)
@. @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
foo9 (generic function with 1 method)
julia> function bar9(a, b, c, d, e, f, g, h, i)
@.. broadcast = true @fastmath a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
bar9 (generic function with 1 method)
julia> function loo9(a, b, c, d, e, f, g, h, i)
@fastmath for ii = eachindex(a)
a[ii] = b[1] + 0.1 * (0.2c[ii] + 0.3d[ii] + 0.4e[ii] + 0.5f[ii] + 0.6g[ii] + 0.6h[ii] + 0.6i[ii])
end
end
loo9 (generic function with 1 method)
julia> a, b, c, d, e, f, g, h, i = [rand(1024) for i = 1:26];
julia> @benchmark foo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.560 μs … 1.846 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.568 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.570 μs ± 15.111 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█▅
▂▃███▆▃▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▂ ▂
1.56 μs Histogram: frequency by time 1.67 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark bar9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.245 μs … 9.063 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.247 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.334 μs ± 541.530 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁▂ ▁
█▁██▆▃▄▃▃▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▄▃▁▃▄▄▄▁▃▄▄▄▄▅▄▄▄▄▄▄▅▅▅▆▆▇ █
4.25 μs Histogram: log(frequency) by time 8.03 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark loo9($a, $[1.], $c, $d, $e, $f, $g, $h, $i)
BenchmarkTools.Trial: 10000 samples with 195 evaluations.
Range (min … max): 490.226 ns … 710.538 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 492.646 ns ┊ GC (median): 0.00%
Time (mean ± σ): 498.221 ns ± 13.150 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▇▅▂▃▂ ▂▄▄▃▂▄▅▃▂▁▂▁ ▁▃▄▃▃▁▁ ▂
███████▇████████████▇███████▆██▇█▅▅▇▆▆▆▆█▆▄▅█▇▃▅█▅▃▁▅▄▃▁▅▃▃▁▆ █
490 ns Histogram: log(frequency) by time 544 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
Reopening for regression tests.
I'd argue that Base.broadcast should get this fast path, as using broadcast for map operations is quite common and so getting a major performance hit in the standard case isn't great. |
Here is a minimal working example.
The text was updated successfully, but these errors were encountered: