Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked copy in copyto_unalised for nD Cartesian destination #53234

Closed
wants to merge 3 commits into from

Conversation

jishnub
Copy link
Contributor

@jishnub jishnub commented Feb 7, 2024

The idea is that if the indices of the destination are CartesianIndices((r1, r2)), we may treat it as a collection of slices with indices CartesianIndices((r1,)), and possibly dispatch to the more efficient linear-indexing branch for the individual slices. The method is called recursively on the slices.

Performance comparisons:

julia> a = rand(200, 200); b = rand(size(a)...);

julia> @btime $a[axes($a)...] .= @view $b[axes($b)...];
  73.799 μs (0 allocations: 0 bytes) # master
  10.077 μs (0 allocations: 0 bytes) # PR

julia> @btime $a[axes($a)...] .= $b;
  38.515 μs (0 allocations: 0 bytes) # master
  10.155 μs (0 allocations: 0 bytes) # PR

julia> @btime $a[reverse.(axes($a))...] .= @view $b[axes($b)...];
  122.423 μs (0 allocations: 0 bytes) # master
  25.809 μs (0 allocations: 0 bytes) # PR

julia> @btime $a[reverse.(axes($a))...] .= @view $b[reverse.(axes($b))...];
  545.094 μs (0 allocations: 0 bytes) # master
  33.586 μs (0 allocations: 0 bytes) # PR

One concern is that constructing the views may allocate in certain cases (see e.g. #53231):

julia> @btime $a[$(collect.(axes(a)))...] .= @view $b[$(collect(axes(b)))...];
  76.822 μs (7 allocations: 240 bytes) # master
  52.168 μs (407 allocations: 325.23 KiB) # PR

@jishnub jishnub added performance Must go faster arrays [a, r, r, a, y, s] labels Feb 7, 2024
@jishnub jishnub requested a review from N5N3 February 7, 2024 16:02
@N5N3
Copy link
Member

N5N3 commented Feb 7, 2024

I'm not sure if this is the correct direction.
The shared-iterator branch (L1103-1117) is just @inbounds @simd for I in iterdest
and the Dual-iterator branch (L1122-1137) could be done via @inbounds @simd for I in view(iterdest, itersrc)
(And we'd better check itersrc isa AbstractUnitRange rather than srcstyle isa IndexLinear)

We just need to solve the bootstrap issue.

@jishnub
Copy link
Contributor Author

jishnub commented Feb 7, 2024

A part of the performance difference here would be reduced if #53158 is resolved. However, profiling suggests that integer comparisons take up a lot of time in iterating over Cartesian ranges, so if we may replace that by linear ranges, there is some performance gain to be had there.

@jishnub
Copy link
Contributor Author

jishnub commented Feb 7, 2024

Trying out the suggestion for the shared iterator case:

julia> function copyto_unaliased!(deststyle::IndexStyle, dest::AbstractArray, srcstyle::IndexStyle, src::AbstractArray)
           isempty(src) && return dest
           destinds, srcinds = LinearIndices(dest), LinearIndices(src)
           idf, isf = first(destinds), first(srcinds)
           Δi = idf - isf
           (checkbounds(Bool, destinds, isf+Δi) & checkbounds(Bool, destinds, last(srcinds)+Δi)) ||
               throw(BoundsError(dest, srcinds))
           if deststyle isa IndexLinear
               if srcstyle isa IndexLinear
                   # Single-index implementation
                   @inbounds for i in srcinds
                       if isassigned(src, i)
                           dest[i + Δi] = src[i]
                       else
                           _unsetindex!(dest, i + Δi)
                       end
                   end
               else
                   # Dual-index implementation
                   i = idf - 1
                   @inbounds for a in eachindex(src)
                       i += 1
                       if isassigned(src, a)
                           dest[i] = src[a]
                       else
                           _unsetindex!(dest, i)
                       end
                   end
               end
           else
               iterdest, itersrc = eachindex(dest), eachindex(src)
               if iterdest == itersrc
                   # Shared-iterator implementation
                   @inbounds @simd for I in iterdest
                       if isassigned(src, I)
                           dest[I] = src[I]
                       else
                           _unsetindex!(dest, I)
                       end
                   end
               else
                   # Dual-iterator implementation
                   ret = iterate(iterdest)
                   @inbounds for a in itersrc
                       idx, state = ret::NTuple{2,Any}
                       if isassigned(src, a)
                           dest[idx] = src[a]
                       else
                           _unsetindex!(dest, idx)
                       end
                       ret = iterate(iterdest, state)
                   end
               end
           end
           return dest
       end
copyto_unaliased! (generic function with 1 method)

julia> a = rand(200, 200); b = rand(size(a)...);

julia> @btime $a[reverse.(axes($a))...] .= @view $b[reverse.(axes($b))...];
  544.766 μs (0 allocations: 0 bytes)

julia> @btime copyto_unaliased!(IndexCartesian(), $(view(a, reverse.(axes(a))...)), IndexCartesian(), $(view(b, reverse.(axes(b))...)));
  492.596 μs (0 allocations: 0 bytes)

julia> @btime $a[axes($a)...] .= @view $b[axes($b)...];
  73.170 μs (0 allocations: 0 bytes)

julia> @btime copyto_unaliased!(IndexCartesian(), $(view(a, axes(a)...)), IndexCartesian(), $(view(b, axes(b)...)));
  75.533 μs (0 allocations: 0 bytes)

The performances with and without @simd appear comparable on v"1.11.0-DEV.1486". My guess is that this is because the runtime is dominated not by the indexing opreations, but by integer arithmetic and comparisons in iterating over CartesianIndices.

@jishnub
Copy link
Contributor Author

jishnub commented Feb 8, 2024

Closing this until I understand the reason behind this better

@jishnub jishnub closed this Feb 8, 2024
@giordano giordano deleted the jishnub/copytounaliasedchunked branch February 25, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants