Chunked copy in copyto_unalised for nD Cartesian destination #53234

jishnub · 2024-02-07T15:55:28Z

The idea is that if the indices of the destination are CartesianIndices((r1, r2)), we may treat it as a collection of slices with indices CartesianIndices((r1,)), and possibly dispatch to the more efficient linear-indexing branch for the individual slices. The method is called recursively on the slices.

Performance comparisons:

julia> a = rand(200, 200); b = rand(size(a)...);

julia> @btime $a[axes($a)...] .= @view $b[axes($b)...];
  73.799 μs (0 allocations: 0 bytes) # master
  10.077 μs (0 allocations: 0 bytes) # PR

julia> @btime $a[axes($a)...] .= $b;
  38.515 μs (0 allocations: 0 bytes) # master
  10.155 μs (0 allocations: 0 bytes) # PR

julia> @btime $a[reverse.(axes($a))...] .= @view $b[axes($b)...];
  122.423 μs (0 allocations: 0 bytes) # master
  25.809 μs (0 allocations: 0 bytes) # PR

julia> @btime $a[reverse.(axes($a))...] .= @view $b[reverse.(axes($b))...];
  545.094 μs (0 allocations: 0 bytes) # master
  33.586 μs (0 allocations: 0 bytes) # PR

One concern is that constructing the views may allocate in certain cases (see e.g. #53231):

julia> @btime $a[$(collect.(axes(a)))...] .= @view $b[$(collect(axes(b)))...];
  76.822 μs (7 allocations: 240 bytes) # master
  52.168 μs (407 allocations: 325.23 KiB) # PR

N5N3 · 2024-02-07T16:25:49Z

I'm not sure if this is the correct direction.
The shared-iterator branch (L1103-1117) is just @inbounds @simd for I in iterdest
and the Dual-iterator branch (L1122-1137) could be done via @inbounds @simd for I in view(iterdest, itersrc)
(And we'd better check itersrc isa AbstractUnitRange rather than srcstyle isa IndexLinear)

We just need to solve the bootstrap issue.

jishnub · 2024-02-07T16:31:58Z

A part of the performance difference here would be reduced if #53158 is resolved. However, profiling suggests that integer comparisons take up a lot of time in iterating over Cartesian ranges, so if we may replace that by linear ranges, there is some performance gain to be had there.

jishnub · 2024-02-07T17:04:55Z

Trying out the suggestion for the shared iterator case:

julia> function copyto_unaliased!(deststyle::IndexStyle, dest::AbstractArray, srcstyle::IndexStyle, src::AbstractArray)
           isempty(src) && return dest
           destinds, srcinds = LinearIndices(dest), LinearIndices(src)
           idf, isf = first(destinds), first(srcinds)
           Δi = idf - isf
           (checkbounds(Bool, destinds, isf+Δi) & checkbounds(Bool, destinds, last(srcinds)+Δi)) ||
               throw(BoundsError(dest, srcinds))
           if deststyle isa IndexLinear
               if srcstyle isa IndexLinear
                   # Single-index implementation
                   @inbounds for i in srcinds
                       if isassigned(src, i)
                           dest[i + Δi] = src[i]
                       else
                           _unsetindex!(dest, i + Δi)
                       end
                   end
               else
                   # Dual-index implementation
                   i = idf - 1
                   @inbounds for a in eachindex(src)
                       i += 1
                       if isassigned(src, a)
                           dest[i] = src[a]
                       else
                           _unsetindex!(dest, i)
                       end
                   end
               end
           else
               iterdest, itersrc = eachindex(dest), eachindex(src)
               if iterdest == itersrc
                   # Shared-iterator implementation
                   @inbounds @simd for I in iterdest
                       if isassigned(src, I)
                           dest[I] = src[I]
                       else
                           _unsetindex!(dest, I)
                       end
                   end
               else
                   # Dual-iterator implementation
                   ret = iterate(iterdest)
                   @inbounds for a in itersrc
                       idx, state = ret::NTuple{2,Any}
                       if isassigned(src, a)
                           dest[idx] = src[a]
                       else
                           _unsetindex!(dest, idx)
                       end
                       ret = iterate(iterdest, state)
                   end
               end
           end
           return dest
       end
copyto_unaliased! (generic function with 1 method)

julia> a = rand(200, 200); b = rand(size(a)...);

julia> @btime $a[reverse.(axes($a))...] .= @view $b[reverse.(axes($b))...];
  544.766 μs (0 allocations: 0 bytes)

julia> @btime copyto_unaliased!(IndexCartesian(), $(view(a, reverse.(axes(a))...)), IndexCartesian(), $(view(b, reverse.(axes(b))...)));
  492.596 μs (0 allocations: 0 bytes)

julia> @btime $a[axes($a)...] .= @view $b[axes($b)...];
  73.170 μs (0 allocations: 0 bytes)

julia> @btime copyto_unaliased!(IndexCartesian(), $(view(a, axes(a)...)), IndexCartesian(), $(view(b, axes(b)...)));
  75.533 μs (0 allocations: 0 bytes)

The performances with and without @simd appear comparable on v"1.11.0-DEV.1486". My guess is that this is because the runtime is dominated not by the indexing opreations, but by integer arithmetic and comparisons in iterating over CartesianIndices.

jishnub · 2024-02-08T15:35:44Z

Closing this until I understand the reason behind this better

jishnub added 2 commits February 7, 2024 16:43

Chunked copy in copyto_unalised for nD Cartesian dest

4087035

Add test

b1847a9

jishnub added performance Must go faster arrays [a, r, r, a, y, s] labels Feb 7, 2024

jishnub requested a review from N5N3 February 7, 2024 16:02

Change to checking if itersrc isa AbstractUnitRange

44600da

jishnub closed this Feb 8, 2024

giordano deleted the jishnub/copytounaliasedchunked branch February 25, 2024 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked copy in copyto_unalised for nD Cartesian destination #53234

Chunked copy in copyto_unalised for nD Cartesian destination #53234

jishnub commented Feb 7, 2024 •

edited

Loading

N5N3 commented Feb 7, 2024 •

edited

Loading

jishnub commented Feb 7, 2024 •

edited

Loading

jishnub commented Feb 7, 2024 •

edited

Loading

jishnub commented Feb 8, 2024

Chunked copy in copyto_unalised for nD Cartesian destination #53234

Chunked copy in copyto_unalised for nD Cartesian destination #53234

Conversation

jishnub commented Feb 7, 2024 • edited Loading

N5N3 commented Feb 7, 2024 • edited Loading

jishnub commented Feb 7, 2024 • edited Loading

jishnub commented Feb 7, 2024 • edited Loading

jishnub commented Feb 8, 2024

jishnub commented Feb 7, 2024 •

edited

Loading

N5N3 commented Feb 7, 2024 •

edited

Loading

jishnub commented Feb 7, 2024 •

edited

Loading

jishnub commented Feb 7, 2024 •

edited

Loading