Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up copy!(dest, Rdest, src, Rsrc) by splitting out first dimension #20944

Merged
merged 4 commits into from
Mar 10, 2017

Conversation

timholy
Copy link
Member

@timholy timholy commented Mar 8, 2017

I noticed something also reported by @JKrehl: our "block copier" copy!(dest, Rdest, src, Rsrc) (where Rdest and Rsrc are CartesianRanges) is slow. Presumably #9080 has something to do with it.

Master:

julia> A = rand(200, 200, 16);

julia> B = similar(A);

julia> R = CartesianRange(indices(A))
CartesianRange{CartesianIndex{3}}(CartesianIndex{3}((1, 1, 1)), CartesianIndex{3}((200, 200, 16)))

julia> copy!(B, R, A, R);

julia> @benchmark copy!(B, R, A, R) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.379 ms (0.00% GC)
  median time:      3.499 ms (0.00% GC)
  mean time:        3.591 ms (0.00% GC)
  maximum time:     6.628 ms (0.00% GC)
  --------------
  samples:          273
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This PR:

julia> @benchmark copy!(B, R, A, R) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     688.652 μs (0.00% GC)
  median time:      838.804 μs (0.00% GC)
  mean time:        857.291 μs (0.00% GC)
  maximum time:     1.689 ms (0.00% GC)
  --------------
  samples:          1074
  evals/sample:     1
  time tolerance:   15.00%
  memory tolerance: 1.00%

So approximately 4x faster for a 3d array.

I also decided to write a docstring for this method, and to explicitly test it. (I guess previously it had been tested only indirectly.)

@ararslan ararslan added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Mar 8, 2017
@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

The bottleneck in my view is the parsing of the indices into the arrayref where, I suspect, the length tuple generated by iterating the CartesianRange is resolved at runtime. By unpacking the tuple and the iteration-loops explicitly (at compile time) the speedup is even more substantial as well as having less code.

@generated function copy!{T,N}(dest::AbstractArray{T,N}, Rdest::CartesianRange{CartesianIndex{N}}, src::AbstractArray{T,N}, Rsrc::CartesianRange{CartesianIndex{N}})
    quote
        isempty(Rdest) && return dest
        if size(Rdest) != size(Rsrc)
            throw(ArgumentError("source and destination must have same size (got $(size(Rsrc)) and $(size(Rdest)))"))
        end
        @boundscheck checkbounds(dest, Rdest.start)
        @boundscheck checkbounds(dest, Rdest.stop)
        @boundscheck checkbounds(src, Rsrc.start)
        @boundscheck checkbounds(src, Rsrc.stop)
        ΔI = Rdest.start - Rsrc.start
        @nloops $N i (n->Rsrc.start[n]-start(indices(Rsrc)[n])+indices(Rsrc)[n]) begin
            @inbounds @nref($N,dest,n->i_n+ΔI[n]) = @nref($N,src,i)
        end
        dest
    end
end

Wih the timings:

In[34]: precompile(copy!, map(typeof, (B, R, A, R)))
@benchmark copy!($B, $R, $A, $R) seconds=1
Out[34]: BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     373.251 μs (0.00% GC)
  median time:      383.829 μs (0.00% GC)
  mean time:        402.241 μs (0.00% GC)
  maximum time:     730.292 μs (0.00% GC)
  --------------
  samples:          2370
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In[48]: precompile(Base.copy!, map(typeof, (B, R, A, R)))
@benchmark Base.copy!($B, $R, $A, $R) seconds=1
Out[48]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.665 ms (0.00% GC)
  median time:      2.768 ms (0.00% GC)
  mean time:        2.824 ms (0.00% GC)
  maximum time:     3.380 ms (0.00% GC)
  --------------
  samples:          351
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

There is no indices(::CartesianRange, ::Int) defined, is that issue-worthy?
Wouldn't it be sensible to define indices(::CartesianRange) as between start and 'stop' rather than 1 and 'size'?

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

Your version has the same performance as mine when I run yours on my machine. Are you comparing timings on your machine versus timings on my machine? Or are you genuinely faster on yours?

There's a push to get rid of @generated functions wherever possible, so I'm afraid that even if this is faster, depending on Base.Cartesian is probably not something we want to do here.

@KristofferC
Copy link
Member

KristofferC commented Mar 8, 2017

There's a push to get rid of @generated functions wherever possible

While this is true, this seems like a perfect usage of a non-dangerous @generated (pure loop generation) function. I haven't benchmarked myself so if the two methods are of the same speed then of course, go with the non @generated, but it feels a bit silly to leave performance on the table because of not wanting to use a language feature in the intended way because of "reasons".

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

If I correct for the difference in the Base.copy! (I forgot to re-rename it in my pasted timings), the difference is still a factor of 0.7.

Overall I do think my solution is much more compact and straight-forward as it tackles the problem of speed difference between A[5,5,5] and A[(5,5,5)...] (which is about factor 3.5 in my mini-test) directly. Your solution does that to but only for the first dimension.

That fundamental problem could be fixed by defining getindex{T,N}(::AbstractArray{T,N}, ::NTuple{N,Int}) or something like that so the index tuple does not need to be parsed at access time. Such a solution could allow an equally fast solution without a @generated function (BTW I do think that is exactly what they are for if equivalent code cannot be generated by normal syntactic means).

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

It's parsed at compile time.

julia> A = rand(5,5,5);

julia> I = CartesianIndex(5,5,5);

julia> foo1(A, I) = (@inbounds ret = A[I]; ret)
foo1 (generic function with 1 method)

julia> foo3(A, i, j, k) = (@inbounds ret = A[i,j,k]; ret)
foo3 (generic function with 1 method)

julia> @code_native foo1(A, I)
        .text
Filename: REPL[3]
        pushq   %rbp
        movq    %rsp, %rbp
        movq    (%rdi), %rax
Source line: 1
        movq    8(%rsi), %rcx
        movq    16(%rsi), %rdx
        addq    $-1, %rdx
        imulq   32(%rdi), %rdx
        leaq    -1(%rcx,%rdx), %rcx
        imulq   24(%rdi), %rcx
        addq    (%rsi), %rcx
        vmovsd  -8(%rax,%rcx,8), %xmm0  # xmm0 = mem[0],zero
        popq    %rbp
        retq
        nopl    (%rax)

julia> @code_native foo3(A, 5, 5, 5)
        .text
Filename: REPL[4]
        pushq   %rbp
        movq    %rsp, %rbp
        movq    (%rdi), %rax
Source line: 1
        addq    $-1, %rcx
        imulq   32(%rdi), %rcx
        leaq    -1(%rdx,%rcx), %rcx
        imulq   24(%rdi), %rcx
        addq    %rsi, %rcx
        vmovsd  -8(%rax,%rcx,8), %xmm0  # xmm0 = mem[0],zero
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

Are you sure it's faster? Maybe build this branch to be certain?

That said, the Base.Cartesian solution does circumvent #9080 completely, not just for the first dimension. It would matter if the first dimension is tiny.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

There is no indices(::CartesianRange, ::Int) defined, is that issue-worthy? Wouldn't it be sensible to define indices(::CartesianRange) as between start and 'stop' rather than 1 and 'size'?

An oversight that's definitely issue-worthy (would be nicer than the convert version). Thanks for noticing! I can just add it to this, though, no need to file an issue.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

One problem, though:

julia> A = rand(3,5);

julia> R = CartesianRange(indices(A));

julia> indices(A)
(Base.OneTo(3), Base.OneTo(5))

julia> indices(R)
(1:3, 1:5)

So we lose the type information. Consequence:

julia> similar(A, indices(A))
3×5 Array{Float64,2}:
 6.90335e-310  6.90335e-310  6.90335e-310  6.90335e-310  6.90334e-310
 6.90335e-310  6.90335e-310  6.90335e-310  6.90334e-310  6.90334e-310
 6.90335e-310  6.90335e-310  6.90335e-310  6.90334e-310  0.0         

julia> similar(A, indices(R))
ERROR: MethodError: no method matching similar(::Array{Float64,2}, ::Type{Float64}, ::Tuple{UnitRange{Int64},UnitRange{Int64}})
Closest candidates are:
  similar(::Array{T,2}, ::Type) where T at array.jl:179
  similar(::Array, ::Type, ::Tuple{Vararg{Int64,N}}) where N at array.jl:181
  similar(::AbstractArray, ::Type{T}) where T at abstractarray.jl:507
  ...
Stacktrace:
 [1] similar(::Array{Float64,2}, ::Tuple{UnitRange{Int64},UnitRange{Int64}}) at ./abstractarray.jl:508

julia> using OffsetArrays

julia> similar(A, indices(R))
OffsetArrays.OffsetArray{Float64,2,Array{Float64,2}} with indices 1:3×1:5:
 6.90335e-310  6.90335e-310  6.90335e-310  6.90335e-310  6.90335e-310
 6.90334e-310  6.90334e-310  6.90334e-310  6.90335e-310  6.90335e-310
 6.90335e-310  6.90334e-310  6.90335e-310  6.90335e-310  6.90335e-310

I think we should rewrite CartesianRange so it's just a tuple of AbstractUnitRange objects. But we can't do that for 0.6 at this point, since that would be a breaking change.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

@timholy
Yeah, the "mini-test" was a bit faulty.

Since indices(X) now carries information how the object X is indexed is it really consistent for CartesianRanges as they cannot be accessed per index and are rather iterator-like than array-like?
Option 1: indices(::CartesianRange) returns canonical (OneTo) indices of the size of the n-cuboid defined by the CartesianRange
Option 2: indices(::CartesianRange) returns unit ranges of indices in the "parent" of the CartesianRange, but then this cannot be used to create non-OffsetArrays arrays

On the original topic, maybe the speed of CartesianRange-iteration is the problem?

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

Not sure I fully understand your point. I should clarify that I mean changing it to something like:

struct CartesianRange{N,R<:NTuple{N,AbstractUnitRange}}
    inds::R
end

It would still have its own start, step, and done functions that give the same behavior it has now; it's just that it's capable of preserving the type of the AbstractUnitRanges from which it's constructed.

On the original topic, maybe the speed of CartesianRange-iteration is the problem?

That's what that #9080 issue I've linked to is about. As discussed in greater detail (#16035 (comment)), the fundamental problem is that for i = 1:5 ... end has only one branch per iteration (checking the result of done), whereas for i = CartesianRange((1:5,)) ... end has two (next also has an internal branch, since it has to check if it should "carry"). @Keno has a plan to fix this through compiler optimizations as part of #18823.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

Maybe more precisely: CartesianRange is an array of indices, should indices return the ranges that index the CartesianRange or those that cover the values (which in turn are used to index other arrays)?
I for one advocate the second solution as this would allow a very neat implementation of nested loops over the CartesianRange, although this would only be a duplicate to the iterator interface. I would be rather wary of making CartesianRange carry some starting index information of some array.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

Maybe more precisely: CartesianRange is an array of indices, should indices return the ranges that index the CartesianRange or those that cover the values (which in turn are used to index other arrays)?

Yes, the second. A CartesianRange is not itself indexable.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

Then indices(CartesianRange(CartesianIndex((a,b,c)), CartesianIndex((d,e,f)))) is right to return (a:d,b:e,c:f) because that where the values that where given. That this does not allow the creation of a normal array is not CartesianRange's purview.
In the special case of one starting index being 1, wouldn't it be more sensible to equate 1:x with OneTo(x) in array creation from a tuple of unit ranges?

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

Agreed. It does work if you use CartesianRange(indices(A)), however. The fact that it's possible to not lose the information seems good enough to me.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

Yes
If indices would indeed cover the values of the CartesianRange the above @nloops-solution would simplify to a drop-in (till the direct iteration is equally as fast).

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

Agreed that in cases where Base.Cartesian is closer to the way we actually want to write the code, it's probably the better choice.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

One more point: inlining the CoordinateIndex arithmetics speeds them up by some orders of magnitude:

In[3]: @benchmark for i in 1:640000; +($(CartesianIndex((1,2,3))), $(CartesianIndex((1,2,3)))); end
Out[3]: BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     984.923 μs (0.00% GC)
  median time:      984.943 μs (0.00% GC)
  mean time:        1.021 ms (0.00% GC)
  maximum time:     1.949 ms (0.00% GC)
  --------------
  samples:          4880
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
In[8]: import Base.+
@inline (+){N}(index1::CartesianIndex{N}, index2::CartesianIndex{N}) = CartesianIndex{N}(map(+, index1.I, index2.I))
WARNING: Method definition +(Base.IteratorsMD.CartesianIndex{#N<:Any}, Base.IteratorsMD.CartesianIndex{#N<:Any}) in module IteratorsMD at multidimensional.jl:52 overwritten at In[8]:2.
Out[8]: + (generic function with 163 methods)
In[9]: @benchmark for i in 1:640000; +($(CartesianIndex((1,2,3))), $(CartesianIndex((1,2,3)))); end
Out[9]: BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.297 ns (0.00% GC)
  median time:      1.546 ns (0.00% GC)
  mean time:        1.505 ns (0.00% GC)
  maximum time:     16.189 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

Note that the time here is equivalent to a couple of instructions, which is not possible if it's actually executing that loop. With inlining on, the compiler is smart enough to realize that you're not doing anything with the result of that loop, so it helpfully elides it for you 😄.

Nevertheless, good catch that these operations should be inlined. This is a good day for multidimensional stuff. 😄

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

Fuddlesticks, but 1ms is way too long for something which is a part of something that can be done in 400µs. At least that makes sense^^.

What are we down to in the copy! thingy?

@timholy timholy force-pushed the teh/faster_copyR branch from c30b900 to 9ceb52b Compare March 8, 2017 19:25
@timholy
Copy link
Member Author

timholy commented Mar 8, 2017

@JKrehl, see what you think of this. I credited you with the two most important aspects of this.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 8, 2017

That's great.
Benchmark of copy! takes 400µs in branch versus 2.700ms in v0.5.

Should there be a note near the commented out code that it is to be fixed when #9080 is fixed?

@timholy
Copy link
Member Author

timholy commented Mar 9, 2017

Note added. I'll wait a day or so for any other commentary before merging.

@timholy
Copy link
Member Author

timholy commented Mar 9, 2017

BTW, I no longer think that indices(R) is wrong. Think of A[R] as equivalent to A[convert(Tuple{Vararg{UnitRange}}, R)...]: the key point is that the individual ranges have indices starting at 1. Consequently by the composition property the current result is correct.

I still think we need to change the internal representation, though.

@JKrehl
Copy link
Contributor

JKrehl commented Mar 9, 2017

@timholy Alright then.

Defining a CartesianRange as a a cuboid of unit ranges does seem more natural to me. It would also allow to differentiale between the dims, e.g. indexing some axes with smaller or larger Int-types (if somebody does want to do that).

@tkelman
Copy link
Contributor

tkelman commented Mar 9, 2017

don't know how well this is covered by existing benchmarks, but should really run @nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@timholy
Copy link
Member Author

timholy commented Mar 10, 2017

Not covered, but there were also no regressions.

@timholy timholy merged commit e91065c into master Mar 10, 2017
@timholy timholy deleted the teh/faster_copyR branch March 10, 2017 09:47
@KristofferC KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants