Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944

timholy · 2017-03-08T03:17:38Z

I noticed something also reported by @JKrehl: our "block copier" copy!(dest, Rdest, src, Rsrc) (where Rdest and Rsrc are CartesianRanges) is slow. Presumably #9080 has something to do with it.

Master:

julia> A = rand(200, 200, 16);

julia> B = similar(A);

julia> R = CartesianRange(indices(A))
CartesianRange{CartesianIndex{3}}(CartesianIndex{3}((1, 1, 1)), CartesianIndex{3}((200, 200, 16)))

julia> copy!(B, R, A, R);

julia> @benchmark copy!(B, R, A, R) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.379 ms (0.00% GC)
  median time:      3.499 ms (0.00% GC)
  mean time:        3.591 ms (0.00% GC)
  maximum time:     6.628 ms (0.00% GC)
  --------------
  samples:          273
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This PR:

julia> @benchmark copy!(B, R, A, R) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     688.652 μs (0.00% GC)
  median time:      838.804 μs (0.00% GC)
  mean time:        857.291 μs (0.00% GC)
  maximum time:     1.689 ms (0.00% GC)
  --------------
  samples:          1074
  evals/sample:     1
  time tolerance:   15.00%
  memory tolerance: 1.00%

So approximately 4x faster for a 3d array.

I also decided to write a docstring for this method, and to explicitly test it. (I guess previously it had been tested only indirectly.)

JKrehl · 2017-03-08T08:07:53Z

The bottleneck in my view is the parsing of the indices into the arrayref where, I suspect, the length tuple generated by iterating the CartesianRange is resolved at runtime. By unpacking the tuple and the iteration-loops explicitly (at compile time) the speedup is even more substantial as well as having less code.

@generated function copy!{T,N}(dest::AbstractArray{T,N}, Rdest::CartesianRange{CartesianIndex{N}}, src::AbstractArray{T,N}, Rsrc::CartesianRange{CartesianIndex{N}})
    quote
        isempty(Rdest) && return dest
        if size(Rdest) != size(Rsrc)
            throw(ArgumentError("source and destination must have same size (got $(size(Rsrc)) and $(size(Rdest)))"))
        end
        @boundscheck checkbounds(dest, Rdest.start)
        @boundscheck checkbounds(dest, Rdest.stop)
        @boundscheck checkbounds(src, Rsrc.start)
        @boundscheck checkbounds(src, Rsrc.stop)
        ΔI = Rdest.start - Rsrc.start
        @nloops $N i (n->Rsrc.start[n]-start(indices(Rsrc)[n])+indices(Rsrc)[n]) begin
            @inbounds @nref($N,dest,n->i_n+ΔI[n]) = @nref($N,src,i)
        end
        dest
    end
end

Wih the timings:

In[34]: precompile(copy!, map(typeof, (B, R, A, R)))
@benchmark copy!($B, $R, $A, $R) seconds=1
Out[34]: BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     373.251 μs (0.00% GC)
  median time:      383.829 μs (0.00% GC)
  mean time:        402.241 μs (0.00% GC)
  maximum time:     730.292 μs (0.00% GC)
  --------------
  samples:          2370
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In[48]: precompile(Base.copy!, map(typeof, (B, R, A, R)))
@benchmark Base.copy!($B, $R, $A, $R) seconds=1
Out[48]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.665 ms (0.00% GC)
  median time:      2.768 ms (0.00% GC)
  mean time:        2.824 ms (0.00% GC)
  maximum time:     3.380 ms (0.00% GC)
  --------------
  samples:          351
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

There is no indices(::CartesianRange, ::Int) defined, is that issue-worthy?
Wouldn't it be sensible to define indices(::CartesianRange) as between start and 'stop' rather than 1 and 'size'?

timholy · 2017-03-08T09:37:53Z

Your version has the same performance as mine when I run yours on my machine. Are you comparing timings on your machine versus timings on my machine? Or are you genuinely faster on yours?

There's a push to get rid of @generated functions wherever possible, so I'm afraid that even if this is faster, depending on Base.Cartesian is probably not something we want to do here.

KristofferC · 2017-03-08T09:44:03Z

There's a push to get rid of @generated functions wherever possible

While this is true, this seems like a perfect usage of a non-dangerous @generated (pure loop generation) function. I haven't benchmarked myself so if the two methods are of the same speed then of course, go with the non @generated, but it feels a bit silly to leave performance on the table because of not wanting to use a language feature in the intended way because of "reasons".

JKrehl · 2017-03-08T09:55:03Z

If I correct for the difference in the Base.copy! (I forgot to re-rename it in my pasted timings), the difference is still a factor of 0.7.

Overall I do think my solution is much more compact and straight-forward as it tackles the problem of speed difference between A[5,5,5] and A[(5,5,5)...] (which is about factor 3.5 in my mini-test) directly. Your solution does that to but only for the first dimension.

That fundamental problem could be fixed by defining getindex{T,N}(::AbstractArray{T,N}, ::NTuple{N,Int}) or something like that so the index tuple does not need to be parsed at access time. Such a solution could allow an equally fast solution without a @generated function (BTW I do think that is exactly what they are for if equivalent code cannot be generated by normal syntactic means).

timholy · 2017-03-08T10:03:57Z

It's parsed at compile time.

julia> A = rand(5,5,5);

julia> I = CartesianIndex(5,5,5);

julia> foo1(A, I) = (@inbounds ret = A[I]; ret)
foo1 (generic function with 1 method)

julia> foo3(A, i, j, k) = (@inbounds ret = A[i,j,k]; ret)
foo3 (generic function with 1 method)

julia> @code_native foo1(A, I)
        .text
Filename: REPL[3]
        pushq   %rbp
        movq    %rsp, %rbp
        movq    (%rdi), %rax
Source line: 1
        movq    8(%rsi), %rcx
        movq    16(%rsi), %rdx
        addq    $-1, %rdx
        imulq   32(%rdi), %rdx
        leaq    -1(%rcx,%rdx), %rcx
        imulq   24(%rdi), %rcx
        addq    (%rsi), %rcx
        vmovsd  -8(%rax,%rcx,8), %xmm0  # xmm0 = mem[0],zero
        popq    %rbp
        retq
        nopl    (%rax)

julia> @code_native foo3(A, 5, 5, 5)
        .text
Filename: REPL[4]
        pushq   %rbp
        movq    %rsp, %rbp
        movq    (%rdi), %rax
Source line: 1
        addq    $-1, %rcx
        imulq   32(%rdi), %rcx
        leaq    -1(%rdx,%rcx), %rcx
        imulq   24(%rdi), %rcx
        addq    %rsi, %rcx
        vmovsd  -8(%rax,%rcx,8), %xmm0  # xmm0 = mem[0],zero
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

Are you sure it's faster? Maybe build this branch to be certain?

That said, the Base.Cartesian solution does circumvent #9080 completely, not just for the first dimension. It would matter if the first dimension is tiny.

timholy · 2017-03-08T10:10:38Z

There is no indices(::CartesianRange, ::Int) defined, is that issue-worthy? Wouldn't it be sensible to define indices(::CartesianRange) as between start and 'stop' rather than 1 and 'size'?

An oversight that's definitely issue-worthy (would be nicer than the convert version). Thanks for noticing! I can just add it to this, though, no need to file an issue.

timholy · 2017-03-08T10:22:41Z

One problem, though:

julia> A = rand(3,5);

julia> R = CartesianRange(indices(A));

julia> indices(A)
(Base.OneTo(3), Base.OneTo(5))

julia> indices(R)
(1:3, 1:5)

So we lose the type information. Consequence:

julia> similar(A, indices(A))
3×5 Array{Float64,2}:
 6.90335e-310  6.90335e-310  6.90335e-310  6.90335e-310  6.90334e-310
 6.90335e-310  6.90335e-310  6.90335e-310  6.90334e-310  6.90334e-310
 6.90335e-310  6.90335e-310  6.90335e-310  6.90334e-310  0.0         

julia> similar(A, indices(R))
ERROR: MethodError: no method matching similar(::Array{Float64,2}, ::Type{Float64}, ::Tuple{UnitRange{Int64},UnitRange{Int64}})
Closest candidates are:
  similar(::Array{T,2}, ::Type) where T at array.jl:179
  similar(::Array, ::Type, ::Tuple{Vararg{Int64,N}}) where N at array.jl:181
  similar(::AbstractArray, ::Type{T}) where T at abstractarray.jl:507
  ...
Stacktrace:
 [1] similar(::Array{Float64,2}, ::Tuple{UnitRange{Int64},UnitRange{Int64}}) at ./abstractarray.jl:508

julia> using OffsetArrays

julia> similar(A, indices(R))
OffsetArrays.OffsetArray{Float64,2,Array{Float64,2}} with indices 1:3×1:5:
 6.90335e-310  6.90335e-310  6.90335e-310  6.90335e-310  6.90335e-310
 6.90334e-310  6.90334e-310  6.90334e-310  6.90335e-310  6.90335e-310
 6.90335e-310  6.90334e-310  6.90335e-310  6.90335e-310  6.90335e-310

I think we should rewrite CartesianRange so it's just a tuple of AbstractUnitRange objects. But we can't do that for 0.6 at this point, since that would be a breaking change.

JKrehl · 2017-03-08T12:21:47Z

@timholy
Yeah, the "mini-test" was a bit faulty.

Since indices(X) now carries information how the object X is indexed is it really consistent for CartesianRanges as they cannot be accessed per index and are rather iterator-like than array-like?
Option 1: indices(::CartesianRange) returns canonical (OneTo) indices of the size of the n-cuboid defined by the CartesianRange
Option 2: indices(::CartesianRange) returns unit ranges of indices in the "parent" of the CartesianRange, but then this cannot be used to create non-OffsetArrays arrays

On the original topic, maybe the speed of CartesianRange-iteration is the problem?

timholy · 2017-03-08T14:12:36Z

Not sure I fully understand your point. I should clarify that I mean changing it to something like:

struct CartesianRange{N,R<:NTuple{N,AbstractUnitRange}}
    inds::R
end

It would still have its own start, step, and done functions that give the same behavior it has now; it's just that it's capable of preserving the type of the AbstractUnitRanges from which it's constructed.

On the original topic, maybe the speed of CartesianRange-iteration is the problem?

That's what that #9080 issue I've linked to is about. As discussed in greater detail (#16035 (comment)), the fundamental problem is that for i = 1:5 ... end has only one branch per iteration (checking the result of done), whereas for i = CartesianRange((1:5,)) ... end has two (next also has an internal branch, since it has to check if it should "carry"). @Keno has a plan to fix this through compiler optimizations as part of #18823.

JKrehl · 2017-03-08T14:36:53Z

Maybe more precisely: CartesianRange is an array of indices, should indices return the ranges that index the CartesianRange or those that cover the values (which in turn are used to index other arrays)?
I for one advocate the second solution as this would allow a very neat implementation of nested loops over the CartesianRange, although this would only be a duplicate to the iterator interface. I would be rather wary of making CartesianRange carry some starting index information of some array.

timholy · 2017-03-08T14:39:12Z

Maybe more precisely: CartesianRange is an array of indices, should indices return the ranges that index the CartesianRange or those that cover the values (which in turn are used to index other arrays)?

Yes, the second. A CartesianRange is not itself indexable.

JKrehl · 2017-03-08T14:44:28Z

Then indices(CartesianRange(CartesianIndex((a,b,c)), CartesianIndex((d,e,f)))) is right to return (a:d,b:e,c:f) because that where the values that where given. That this does not allow the creation of a normal array is not CartesianRange's purview.
In the special case of one starting index being 1, wouldn't it be more sensible to equate 1:x with OneTo(x) in array creation from a tuple of unit ranges?

timholy · 2017-03-08T14:54:52Z

Agreed. It does work if you use CartesianRange(indices(A)), however. The fact that it's possible to not lose the information seems good enough to me.

JKrehl · 2017-03-08T15:13:31Z

Yes
If indices would indeed cover the values of the CartesianRange the above @nloops-solution would simplify to a drop-in (till the direct iteration is equally as fast).

timholy · 2017-03-08T15:18:30Z

Agreed that in cases where Base.Cartesian is closer to the way we actually want to write the code, it's probably the better choice.

JKrehl · 2017-03-08T15:29:57Z

One more point: inlining the CoordinateIndex arithmetics speeds them up by some orders of magnitude:

In[3]: @benchmark for i in 1:640000; +($(CartesianIndex((1,2,3))), $(CartesianIndex((1,2,3)))); end
Out[3]: BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     984.923 μs (0.00% GC)
  median time:      984.943 μs (0.00% GC)
  mean time:        1.021 ms (0.00% GC)
  maximum time:     1.949 ms (0.00% GC)
  --------------
  samples:          4880
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
In[8]: import Base.+
@inline (+){N}(index1::CartesianIndex{N}, index2::CartesianIndex{N}) = CartesianIndex{N}(map(+, index1.I, index2.I))
WARNING: Method definition +(Base.IteratorsMD.CartesianIndex{#N<:Any}, Base.IteratorsMD.CartesianIndex{#N<:Any}) in module IteratorsMD at multidimensional.jl:52 overwritten at In[8]:2.
Out[8]: + (generic function with 163 methods)
In[9]: @benchmark for i in 1:640000; +($(CartesianIndex((1,2,3))), $(CartesianIndex((1,2,3)))); end
Out[9]: BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.297 ns (0.00% GC)
  median time:      1.546 ns (0.00% GC)
  mean time:        1.505 ns (0.00% GC)
  maximum time:     16.189 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

timholy · 2017-03-08T15:40:29Z

Note that the time here is equivalent to a couple of instructions, which is not possible if it's actually executing that loop. With inlining on, the compiler is smart enough to realize that you're not doing anything with the result of that loop, so it helpfully elides it for you 😄.

Nevertheless, good catch that these operations should be inlined. This is a good day for multidimensional stuff. 😄

JKrehl · 2017-03-08T18:21:35Z

Fuddlesticks, but 1ms is way too long for something which is a part of something that can be done in 400µs. At least that makes sense^^.

What are we down to in the copy! thingy?

timholy · 2017-03-08T19:26:56Z

@JKrehl, see what you think of this. I credited you with the two most important aspects of this.

JKrehl · 2017-03-08T21:26:35Z

That's great.
Benchmark of copy! takes 400µs in branch versus 2.700ms in v0.5.

Should there be a note near the commented out code that it is to be fixed when #9080 is fixed?

timholy · 2017-03-09T01:58:08Z

Note added. I'll wait a day or so for any other commentary before merging.

timholy · 2017-03-09T02:11:56Z

BTW, I no longer think that indices(R) is wrong. Think of A[R] as equivalent to A[convert(Tuple{Vararg{UnitRange}}, R)...]: the key point is that the individual ranges have indices starting at 1. Consequently by the composition property the current result is correct.

I still think we need to change the internal representation, though.

JKrehl · 2017-03-09T10:05:14Z

@timholy Alright then.

Defining a CartesianRange as a a cuboid of unit ranges does seem more natural to me. It would also allow to differentiale between the dims, e.g. indexing some axes with smaller or larger Int-types (if somebody does want to do that).

tkelman · 2017-03-09T15:24:37Z

don't know how well this is covered by existing benchmarks, but should really run @nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-03-09T18:30:24Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

timholy · 2017-03-10T09:47:39Z

Not covered, but there were also no regressions.

ararslan added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Mar 8, 2017

timholy and others added 3 commits March 8, 2017 13:03

Improve docs for copy!

ef49fcb

Speed up copy!(dest, Rdest, src, Rsrc) for CartesianRange inputs

810cade

Force inlining on CartesianIndex operations

9ceb52b

timholy force-pushed the teh/faster_copyR branch from c30b900 to 9ceb52b Compare March 8, 2017 19:25

Add TODO upon fixing issue 9080 [ci skip]

b96ce3b

timholy merged commit e91065c into master Mar 10, 2017

timholy deleted the teh/faster_copyR branch March 10, 2017 09:47

KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944

Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017 •

edited

Loading

timholy commented Mar 8, 2017

KristofferC commented Mar 8, 2017 •

edited

Loading

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

timholy commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017 •

edited by timholy

Loading

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017 •

edited

Loading

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 9, 2017

timholy commented Mar 9, 2017

JKrehl commented Mar 9, 2017

tkelman commented Mar 9, 2017

nanosoldier commented Mar 9, 2017

timholy commented Mar 10, 2017

Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944

Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944

Conversation

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017 • edited Loading

timholy commented Mar 8, 2017

KristofferC commented Mar 8, 2017 • edited Loading

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

timholy commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017 • edited by timholy Loading

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017 • edited Loading

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 8, 2017

JKrehl commented Mar 8, 2017

timholy commented Mar 9, 2017

timholy commented Mar 9, 2017

JKrehl commented Mar 9, 2017

tkelman commented Mar 9, 2017

nanosoldier commented Mar 9, 2017

timholy commented Mar 10, 2017

JKrehl commented Mar 8, 2017 •

edited

Loading

KristofferC commented Mar 8, 2017 •

edited

Loading

JKrehl commented Mar 8, 2017 •

edited by timholy

Loading

JKrehl commented Mar 8, 2017 •

edited

Loading