-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move bounds checks on copyto!(dst, n, src)
#43517
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me
if haslength(src) | ||
checkbounds(dest, i) | ||
checkbounds(dest, i + length(src) - 1) | ||
for x in src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about replace this with
I = eachindex(dest)[i]
@inbounds for x in src
dest[I] = x
I = nextind(dest, I)
end
Might be faster for IndexCartesian
cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it seems reasonable to improve copyto!(dest::AbstractArray, src)
(L893) in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using eachindex(dest)[i]
doesn't seem to help, on things I tried.
But I agree that the whole-array method just above has room for comparable improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see some difference in the following example:
using BenchmarkTools
f(dest, i, src) = begin
I = eachindex(dest)[i]
@inbounds for x in src
dest[I] = x
I = nextind(dest, I)
end
end
g(dest, i, src) = begin
@inbounds for x in src
dest[i] = x
i += 1
end
end
a = view(randn(100,100),1:100,1:100)
@btime f($a, 555, $(i + 1 for i in 1:1000)) # 2.144 μs (0 allocations: 0 bytes)
@btime g($a, 555, $(i + 1 for i in 1:1000)) # 2.956 μs (0 allocations: 0 bytes)
For longer src
or higher dimension, the gain might be bigger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just when I thought I'd convinced myself... these give me:
julia> @btime f($a, 555, $(i + 1 for i in 1:1000))
min 2.718 μs, mean 2.759 μs (0 allocations)
julia> @btime g($a, 555, $(i + 1 for i in 1:1000))
min 804.804 ns, mean 810.625 ns (0 allocations)
This does not affect the 2-arg method?
julia> @btime Base.copyto!($a, 555, $(i + 1 for i in 1:1000));
min 973.938 ns, mean 985.105 ns (0 allocations)
julia> @btime _copyto!($a, 555, $(i + 1 for i in 1:1000)); # first commit of PR
min 806.769 ns, mean 811.212 ns (0 allocations)
julia> @btime _copyto!($a, 555, $(i + 1 for i in 1:1000)); # PR with eaaefb1
min 2.741 μs, mean 2.786 μs (0 allocations)
julia> @btime Base.copyto!($a, $(i + 1 for i in 1:length(a))); # 2-arg method
min 12.875 μs, mean 13.022 μs (0 allocations)
julia> @btime _copyto!($a, $(i + 1 for i in 1:length(a))); # PR with 68e3d5e
min 6.933 μs, mean 7.023 μs (0 allocations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for 2-arg version on M1 native, master
.
I think the problem is firstindex
always return 1
if ndims != 1
. Use first(eachindex(dest))
should make things consistent.
On the other hand, just notice that nextind(A, ind::Base.SCartesianIndex2)
is not defined, so dest
can't be reinterpret(reshape, args...)
...
Not sure is it ok to add related definition in this PR? Or just use eachindex(IndexStyle(dest) isa IndexLinear ? IndexLinear() : IndexCartesian(), dest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, good point re firstindex. So that's just handling offsets right now.
On the xeon, "g" is still an improvement on Base.copyto!
, even if slower than "f" there. Is that true on your computer too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's true, at least we eliminated the boundscheck
within the loop.
I have no M1 machine, so I can't test myself.
Would you mind to bench the following?
a = view(randn(100,100),1:100,1:100)
b = view(a,1:99,1:100)
f_each(x) = begin
r = 0.0
@inbounds for i in eachindex(x)
r += x[i]
end
r
end
f_linear(x) = begin
r = 0.0
@inbounds for i in firstindex(x):lastindex(x)
r += x[i]
end
r
end
@btime f_each($a)
@btime f_linear($a)
@btime f_each($b)
@btime f_linear($b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure:
julia> @btime f_each($a)
min 9.250 μs, mean 9.401 μs (0 allocations)
32.23877902877461
julia> @btime f_linear($a)
min 9.250 μs, mean 9.401 μs (0 allocations)
32.23877902877461
julia> @btime f_each($b)
min 9.166 μs, mean 9.315 μs (0 allocations)
22.794925499363792
julia> @btime f_linear($b)
min 9.166 μs, mean 9.286 μs (0 allocations)
22.794925499363792
vs xeon:
julia> @btime f_each($a)
17.736 μs (0 allocations: 0 bytes)
153.39744409371883
julia> @btime f_linear($a)
82.459 μs (0 allocations: 0 bytes)
153.39744409371883
julia> @btime f_each($b)
17.586 μs (0 allocations: 0 bytes)
153.27406012213328
julia> @btime f_linear($b)
81.627 μs (0 allocations: 0 bytes)
153.27406012213328
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine, the result is
@btime f_each($a) # 9.100 μs (0 allocations: 0 bytes)
@btime f_linear($a) # 22.700 μs (0 allocations: 0 bytes)
@btime f_each($b) # 9.000 μs (0 allocations: 0 bytes)
@btime f_linear($b) # 22.500 μs (0 allocations: 0 bytes)
Is this some "dark magic" of M1? (Maybe we don't need IndexCartesian
on M1?)
Something might related: M1 10x faster than Intel at integral division, throughput one 64-bit divide in two cycles
If this is true, I guess reshape
would be faster if we omit the current optimization via MultiplicativeInverse
copyto!(dst, n, src)
copyto!(dst, n, src)
and copyto!(dest, src)
base/abstractarray.jl
Outdated
throw(ArgumentError("destination has fewer elements than required")) | ||
dest[y[1]] = x | ||
y = iterate(destiter, y[2]) | ||
i = Int(firstindex(dest)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem right to me to switch from eachindex
to firstindex
just because src
has a length.
Also, in the past we have avoided annotating inbounds
in generic methods like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will fiddle a bit more. At the moment, removing @inbounds
removes the whole speed advantage of this method. But perhaps there's a smarter way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, I'm think the PR does not handle length zero correctly right now. It breaks these, which work on master, but don't seem to have tests:
julia> firstindex(Int[])
1
julia> copyto!(Int[], ())
Int64[]
julia> copyto!(Int[], 1, ())
Int64[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this 2-arg method, as on the cases I was testing, I can't see a way to speed it up without using linear indexing & @inbounds
.
I've also added tests for these empty cases, and fixed the 3-arg method to pass them.
copyto!(dst, n, src)
and copyto!(dest, src)
copyto!(dst, n, src)
Pre-1.8 bump? |
What's the status on this? Is it ready to merge? |
@JeffBezanson posted some concern on
Although that part change has been reverted, I'm not sure is it OK to him for the |
Yes that's accurate. I reverted to the smallest initial change, thus this only speeds up |
This speeds up
@btime copyto!($(rand(3,10)), 7, (1.0, 2.0, 3.0));
from2.708 ns
to1.375 ns
.By adding
@inline
and@boundscheck
, we can get@btime @inbounds copyto!($(rand(3,10)), 7, (1.0, 2.0, 3.0));
down to0.875 ns
. But this didn't seem to improve anything when used within a loop in #43334, so maybe that's not necessary.