-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944
Conversation
The bottleneck in my view is the parsing of the indices into the @generated function copy!{T,N}(dest::AbstractArray{T,N}, Rdest::CartesianRange{CartesianIndex{N}}, src::AbstractArray{T,N}, Rsrc::CartesianRange{CartesianIndex{N}})
quote
isempty(Rdest) && return dest
if size(Rdest) != size(Rsrc)
throw(ArgumentError("source and destination must have same size (got $(size(Rsrc)) and $(size(Rdest)))"))
end
@boundscheck checkbounds(dest, Rdest.start)
@boundscheck checkbounds(dest, Rdest.stop)
@boundscheck checkbounds(src, Rsrc.start)
@boundscheck checkbounds(src, Rsrc.stop)
ΔI = Rdest.start - Rsrc.start
@nloops $N i (n->Rsrc.start[n]-start(indices(Rsrc)[n])+indices(Rsrc)[n]) begin
@inbounds @nref($N,dest,n->i_n+ΔI[n]) = @nref($N,src,i)
end
dest
end
end Wih the timings: In[34]: precompile(copy!, map(typeof, (B, R, A, R)))
@benchmark copy!($B, $R, $A, $R) seconds=1
Out[34]: BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 373.251 μs (0.00% GC)
median time: 383.829 μs (0.00% GC)
mean time: 402.241 μs (0.00% GC)
maximum time: 730.292 μs (0.00% GC)
--------------
samples: 2370
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%

In[48]: precompile(Base.copy!, map(typeof, (B, R, A, R)))
@benchmark Base.copy!($B, $R, $A, $R) seconds=1
Out[48]:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.665 ms (0.00% GC)
median time: 2.768 ms (0.00% GC)
mean time: 2.824 ms (0.00% GC)
maximum time: 3.380 ms (0.00% GC)
--------------
samples: 351
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% There is no |
Your version has the same performance as mine when I run yours on my machine. Are you comparing timings on your machine versus timings on my machine? Or are you genuinely faster on yours? There's a push to get rid of |
While this is true, this seems like a perfect usage of a non-dangerous |
If I correct for the difference in the Overall I do think my solution is much more compact and straight-forward as it tackles the problem of speed difference between That fundamental problem could be fixed by defining |
It's parsed at compile time. julia> A = rand(5,5,5);
julia> I = CartesianIndex(5,5,5);
julia> foo1(A, I) = (@inbounds ret = A[I]; ret)
foo1 (generic function with 1 method)
julia> foo3(A, i, j, k) = (@inbounds ret = A[i,j,k]; ret)
foo3 (generic function with 1 method)
julia> @code_native foo1(A, I)
.text
Filename: REPL[3]
pushq %rbp
movq %rsp, %rbp
movq (%rdi), %rax
Source line: 1
movq 8(%rsi), %rcx
movq 16(%rsi), %rdx
addq $-1, %rdx
imulq 32(%rdi), %rdx
leaq -1(%rcx,%rdx), %rcx
imulq 24(%rdi), %rcx
addq (%rsi), %rcx
vmovsd -8(%rax,%rcx,8), %xmm0 # xmm0 = mem[0],zero
popq %rbp
retq
nopl (%rax)
julia> @code_native foo3(A, 5, 5, 5)
.text
Filename: REPL[4]
pushq %rbp
movq %rsp, %rbp
movq (%rdi), %rax
Source line: 1
addq $-1, %rcx
imulq 32(%rdi), %rcx
leaq -1(%rdx,%rcx), %rcx
imulq 24(%rdi), %rcx
addq %rsi, %rcx
vmovsd -8(%rax,%rcx,8), %xmm0 # xmm0 = mem[0],zero
popq %rbp
retq
nopw %cs:(%rax,%rax) Are you sure it's faster? Maybe build this branch to be certain? That said, the Base.Cartesian solution does circumvent #9080 completely, not just for the first dimension. It would matter if the first dimension is tiny. |
An oversight that's definitely issue-worthy (would be nicer than the |
One problem, though: julia> A = rand(3,5);
julia> R = CartesianRange(indices(A));
julia> indices(A)
(Base.OneTo(3), Base.OneTo(5))
julia> indices(R)
(1:3, 1:5) So we lose the type information. Consequence: julia> similar(A, indices(A))
3×5 Array{Float64,2}:
6.90335e-310 6.90335e-310 6.90335e-310 6.90335e-310 6.90334e-310
6.90335e-310 6.90335e-310 6.90335e-310 6.90334e-310 6.90334e-310
6.90335e-310 6.90335e-310 6.90335e-310 6.90334e-310 0.0
julia> similar(A, indices(R))
ERROR: MethodError: no method matching similar(::Array{Float64,2}, ::Type{Float64}, ::Tuple{UnitRange{Int64},UnitRange{Int64}})
Closest candidates are:
similar(::Array{T,2}, ::Type) where T at array.jl:179
similar(::Array, ::Type, ::Tuple{Vararg{Int64,N}}) where N at array.jl:181
similar(::AbstractArray, ::Type{T}) where T at abstractarray.jl:507
...
Stacktrace:
[1] similar(::Array{Float64,2}, ::Tuple{UnitRange{Int64},UnitRange{Int64}}) at ./abstractarray.jl:508
julia> using OffsetArrays
julia> similar(A, indices(R))
OffsetArrays.OffsetArray{Float64,2,Array{Float64,2}} with indices 1:3×1:5:
6.90335e-310 6.90335e-310 6.90335e-310 6.90335e-310 6.90335e-310
6.90334e-310 6.90334e-310 6.90334e-310 6.90335e-310 6.90335e-310
6.90335e-310 6.90334e-310 6.90335e-310 6.90335e-310 6.90335e-310 I think we should rewrite |
@timholy Since On the original topic, maybe the speed of |
Not sure I fully understand your point. I should clarify that I mean changing it to something like:
It would still have its own
That's what that #9080 issue I've linked to is about. As discussed in greater detail (#16035 (comment)), the fundamental problem is that |
Maybe more precisely: |
Yes, the second. A CartesianRange is not itself indexable. |
Then |
Agreed. It does work if you use |
Yes |
Agreed that in cases where |
One more point: inlining the In[3]: @benchmark for i in 1:640000; +($(CartesianIndex((1,2,3))), $(CartesianIndex((1,2,3)))); end
Out[3]: BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 984.923 μs (0.00% GC)
median time: 984.943 μs (0.00% GC)
mean time: 1.021 ms (0.00% GC)
maximum time: 1.949 ms (0.00% GC)
--------------
samples: 4880
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
In[8]: import Base.+
@inline (+){N}(index1::CartesianIndex{N}, index2::CartesianIndex{N}) = CartesianIndex{N}(map(+, index1.I, index2.I))
WARNING: Method definition +(Base.IteratorsMD.CartesianIndex{#N<:Any}, Base.IteratorsMD.CartesianIndex{#N<:Any}) in module IteratorsMD at multidimensional.jl:52 overwritten at In[8]:2.
Out[8]: + (generic function with 163 methods)
In[9]: @benchmark for i in 1:640000; +($(CartesianIndex((1,2,3))), $(CartesianIndex((1,2,3)))); end
Out[9]: BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.297 ns (0.00% GC)
median time: 1.546 ns (0.00% GC)
mean time: 1.505 ns (0.00% GC)
maximum time: 16.189 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00% |
Note that the time here is equivalent to a couple of instructions, which is not possible if it's actually executing that loop. With inlining on, the compiler is smart enough to realize that you're not doing anything with the result of that loop, so it helpfully elides it for you 😄. Nevertheless, good catch that these operations should be inlined. This is a good day for multidimensional stuff. 😄 |
Fuddlesticks, but 1ms is way too long for something which is a part of something that can be done in 400µs. At least that makes sense^^. What are we down to in the |
c30b900
to
9ceb52b
Compare
@JKrehl, see what you think of this. I credited you with the two most important aspects of this. |
That's great. Should there be a note near the commented out code that it is to be fixed when #9080 is fixed? |
Note added. I'll wait a day or so for any other commentary before merging. |
BTW, I no longer think that I still think we need to change the internal representation, though. |
@timholy Alright then. Defining a |
don't know how well this is covered by existing benchmarks, but should really run @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Not covered, but there were also no regressions. |
I noticed something also reported by @JKrehl: our "block copier"
copy!(dest, Rdest, src, Rsrc)
(whereRdest
andRsrc
are CartesianRanges) is slow. Presumably #9080 has something to do with it.Master:
This PR:
So approximately 4x faster for a 3d array.
I also decided to write a docstring for this method, and to explicitly test it. (I guess previously it had been tested only indirectly.)