-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster inv
for normal sized ComplexF64
#47255
Conversation
``` julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64); #Old julia> @benchmark @. $yc64 = inv($xc64) BenchmarkTools.Trial: 10000 samples with 6 evaluations. Range (min … max): 5.467 μs … 10.603 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.983 μs ┊ GC (median): 0.00% Time (mean ± σ): 5.838 μs ± 368.716 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▂ ██▁ ▂ ██▄▅▄▄▃▁▅▆▇▇▆▇▆▆▆▇▇▇▇███▇▇▇█▇███▇▆▆▆▅▆▅▇▆▆▇▇▅▄▅▃▅▄▅▃▅▄▁▃▃▁▃ █ 5.47 μs Histogram: log(frequency) by time 6.86 μs < Memory estimate: 0 bytes, allocs estimate: 0. #new julia> @benchmark @. $yc64 = inv($xc64) BenchmarkTools.Trial: 10000 samples with 9 evaluations. Range (min … max): 2.008 μs … 6.809 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.118 μs ┊ GC (median): 0.00% Time (mean ± σ): 2.209 μs ± 301.481 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▆▆▄█▂▄█▅▂▂▃▁ ▁▂ ▁ ▂ ██████████████████▇███▇▆▆▆▆▆▆▅▆▅▆▆▅▅▆▆▆▅▅▆▅▆▆▅▄▅▅▅▄▅▂▄▅▅▃▂▄ █ 2.01 μs Histogram: log(frequency) by time 3.73 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> xc64 = 1e200.+rand(ComplexF64, 1024); yc64=similar(xc64); #old julia> @benchmark @. $yc64 = inv($xc64) @b^[[ABenchmarkTools.Trial: 10000 samples with 6 evaluations. Range (min … max): 5.465 μs … 10.222 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.482 μs ┊ GC (median): 0.00% Time (mean ± σ): 5.749 μs ± 360.455 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▇▂ ▁ █▆▃▄▄▅▃▄▄▅▅▅▆▇▅▆▆▆▆▇▆▇▄▅▆▆▆██▇▆▇▇▇▇█▆▇▇▇▆▆▇▇▇▆▆▆▅▆▆▆▆▆▄▅▅▄▄ █ 5.46 μs Histogram: log(frequency) by time 6.57 μs < Memory estimate: 0 bytes, allocs estimate: 0. #new julia> @benchmark @. $yc64 = inv($xc64) BenchmarkTools.Trial: 10000 samples with 6 evaluations. Range (min … max): 5.683 μs … 18.619 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 6.214 μs ┊ GC (median): 0.00% Time (mean ± σ): 6.033 μs ± 479.948 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▅█▁▁ ▁ ▂ █▅▄▃▄▆▇▇▇█▇▇████████▇▇███▇▅▆▅▅▆▅▄▁▄▃▄▅▃▄▄▃▄▅▄▄▁▁▃▁▁▁▄▁▁▁▁▅▆ █ 5.68 μs Histogram: log(frequency) by time 8.1 μs < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for most values that will appear in practice this is 2x faster (but it's 20% slower for values that risk overflow.
Orthogonal to the changes you've made (but among the same code), I see a benefit to the following substitution function robust_cinv(c::Float64, d::Float64)
r = d/c
z = muladd(d, r, c)
p = one(r)/z
q = -r/z
return p, q
end The innovation here is to permit the Without your change, this improves the benchmarked runtime from 7.6ns to 5.4ns on my machine. For comparison, your version on your new fastpath benchmarks at 4.1ns. |
Vectorized divide should almost always be slower. |
Of course division is slower than multiplication. But what's at question here is doing two divisions simultaneously versus doing a scalar division followed by a scalar multiplication that depends on the result. I benchmarked the change and it was definitely faster for me. Did you see something different? Using your benchmark on my machine, now: julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64);
# v1.8.0
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 6.640 μs … 22.440 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.720 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.835 μs ± 1.002 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▂▁ ▁
████▆▆▆▄▄▄▃▄▁▃▁▁▁▁▃▃▁▄▁▃▁▁▃▁▅▅▆▅▃▄▄▄▁▄▁▃▃▅▁▃▄▃▃▁▁▁▁▃▁▁▁▁▄▅ █
6.64 μs Histogram: log(frequency) by time 12.4 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
# changes to robust_cinv
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.871 μs … 17.814 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.043 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.111 μs ± 734.624 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▇█▃ ▂
█████▇▆▆▃▅▃▃▃▁▁▃▁▁▁▁▄▃▁▃▁▃▅▄▄▅▄▅▁▃▁▁▁▃▄▄▄▄▅▃▃▁▃▁▃▁▃▄▃▁▃▁▁▅▄ █
4.87 μs Histogram: log(frequency) by time 9.71 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
# this PR
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.667 μs … 11.567 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.689 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.755 μs ± 532.814 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁
█▇▆▆▄▄▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▆▆▆▄▅▄▄▃▁▃▃▃▆▄▄▃▄▄▄▆▅▁▃▄▃▁▁▁▁▄▄ █
2.67 μs Histogram: log(frequency) by time 5.9 μs <
Memory estimate: 0 bytes, allocs estimate: 0. EDIT: messed up the benchmarks. Fixed now. Anyway, I should probably just make a separate PR. |
You're right. I'll add it to this PR. |
Can someone review? I believe this is ready to merge. |
One philosophical nit: That said, the current slow path is frustratingly slow -- especially since it's unnecessary for "typical" numbers. Further, I would imagine that the fast path version is generally more accurate. I think this PR is ultimately an improvement. Unless someone can raise a credible and practical reason to share my anxiety on the differing computational paths, this looks good to me. |
Yeah I really wish I could think of a better slow path which would let us delete the fast path, but I somewhat doubt it exists. |
* faster `inv` for normal sized `ComplexF64`
The other recursion coefficients were already all using only real number types, but the initial condition was forgotten. This has worked thus far, but a change in the upcoming Julia v1.9 (presumably JuliaLang/julia#47255) broke the exact correspondance between real inputs and real-axis complex inputs for the spherical normalization. Specifically, on Julia 1.8: ```julia julia> r = inv(sqrt(4convert(Float64, π))) 0.28209479177387814 julia> r - real(inv(sqrt(4convert(ComplexF64, π)))) 0.0 ``` while with Julia 1.9: ```julia julia> r = inv(sqrt(4convert(Float64, π))) 0.28209479177387814 julia> r - real(inv(sqrt(4convert(ComplexF64, π)))) -5.551115123125783e-17 ``` The simple change that allows the tests to pass again is to just ask for the initial condition to be calculated in the appropriate real number type.
The other recursion coefficients were already all using only real number types, but the initial condition was forgotten. This has worked thus far, but a change in the upcoming Julia v1.9 (presumably JuliaLang/julia#47255) broke the exact correspondance between real inputs and real-axis complex inputs for the spherical normalization. Specifically, on Julia 1.8: ```julia julia> r = inv(sqrt(4convert(Float64, π))) 0.28209479177387814 julia> r - real(inv(sqrt(4convert(ComplexF64, π)))) 0.0 ``` while with Julia 1.9: ```julia julia> r = inv(sqrt(4convert(Float64, π))) 0.28209479177387814 julia> r - real(inv(sqrt(4convert(ComplexF64, π)))) -5.551115123125783e-17 ``` The simple change that allows the tests to pass again is to just ask for the initial condition to be calculated in the appropriate real number type.
Found by https://discourse.julialang.org/t/how-to-optimize-computation-within-vectorized-list-operation-and-large-array/89008
So for most values that will appear in practice this is 2x faster (but it's 20% slower for values that risk overflow.