faster `inv` for normal sized `ComplexF64` #47255

oscardssmith · 2022-10-20T15:53:01Z

Found by https://discourse.julialang.org/t/how-to-optimize-computation-within-vectorized-list-operation-and-large-array/89008

julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64);
#Old
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.467 μs …  10.603 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.983 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.838 μs ± 368.716 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂                   ██▁                                    ▂
  ██▄▅▄▄▃▁▅▆▇▇▆▇▆▆▆▇▇▇▇███▇▇▇█▇███▇▆▆▆▅▆▅▇▆▆▇▇▅▄▅▃▅▄▅▃▅▄▁▃▃▁▃ █
  5.47 μs      Histogram: log(frequency) by time      6.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
#new
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.937 μs …   5.240 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.041 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.052 μs ± 133.940 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂█▆▁    ▅▃█▁                                         
  ▁▂▅▆▄▂▁▂████▆▃▄█████▃▃▂▂▂▇▅▃▃▂▃▂▂▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.94 μs         Histogram: frequency by time        2.32 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> xc64 = 1e200.+rand(ComplexF64, 1024); yc64=similar(xc64);
#old
julia> @benchmark @. $yc64 = inv($xc64)
@b^[[ABenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.465 μs …  10.222 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.482 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.749 μs ± 360.455 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                          ▇▂                               ▁
  █▆▃▄▄▅▃▄▄▅▅▅▆▇▅▆▆▆▆▇▆▇▄▅▆▆▆██▇▆▇▇▇▇█▆▇▇▇▆▆▇▇▇▆▆▆▅▆▆▆▆▆▄▅▅▄▄ █
  5.46 μs      Histogram: log(frequency) by time      6.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
#new
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.697 μs …  14.716 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.727 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.850 μs ± 454.134 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█                     ▅▃                                   ▁
  ███▅▄▁▄▃▃▅▅▆▇▇▇▆▇▆▇▇▇▇███▇▇▇▇▇▇▇▇▆▆▆▆▆▅▆▆▆▄▅▆▆▆▅▆▆▅▄▅▅▅▃▄▅▆ █
  5.7 μs       Histogram: log(frequency) by time      7.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So for most values that will appear in practice this is 2x faster (but it's 20% slower for values that risk overflow.

@benchmark

``` julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64); #Old julia> @benchmark @. $yc64 = inv($xc64) BenchmarkTools.Trial: 10000 samples with 6 evaluations. Range (min … max): 5.467 μs … 10.603 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.983 μs ┊ GC (median): 0.00% Time (mean ± σ): 5.838 μs ± 368.716 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▂ ██▁ ▂ ██▄▅▄▄▃▁▅▆▇▇▆▇▆▆▆▇▇▇▇███▇▇▇█▇███▇▆▆▆▅▆▅▇▆▆▇▇▅▄▅▃▅▄▅▃▅▄▁▃▃▁▃ █ 5.47 μs Histogram: log(frequency) by time 6.86 μs < Memory estimate: 0 bytes, allocs estimate: 0. #new julia> @benchmark @. $yc64 = inv($xc64) BenchmarkTools.Trial: 10000 samples with 9 evaluations. Range (min … max): 2.008 μs … 6.809 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 2.118 μs ┊ GC (median): 0.00% Time (mean ± σ): 2.209 μs ± 301.481 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▆▆▄█▂▄█▅▂▂▃▁ ▁▂ ▁ ▂ ██████████████████▇███▇▆▆▆▆▆▆▅▆▅▆▆▅▅▆▆▆▅▅▆▅▆▆▅▄▅▅▅▄▅▂▄▅▅▃▂▄ █ 2.01 μs Histogram: log(frequency) by time 3.73 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> xc64 = 1e200.+rand(ComplexF64, 1024); yc64=similar(xc64); #old julia> @benchmark @. $yc64 = inv($xc64) @b^[[ABenchmarkTools.Trial: 10000 samples with 6 evaluations. Range (min … max): 5.465 μs … 10.222 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.482 μs ┊ GC (median): 0.00% Time (mean ± σ): 5.749 μs ± 360.455 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▇▂ ▁ █▆▃▄▄▅▃▄▄▅▅▅▆▇▅▆▆▆▆▇▆▇▄▅▆▆▆██▇▆▇▇▇▇█▆▇▇▇▆▆▇▇▇▆▆▆▅▆▆▆▆▆▄▅▅▄▄ █ 5.46 μs Histogram: log(frequency) by time 6.57 μs < Memory estimate: 0 bytes, allocs estimate: 0. #new julia> @benchmark @. $yc64 = inv($xc64) BenchmarkTools.Trial: 10000 samples with 6 evaluations. Range (min … max): 5.683 μs … 18.619 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 6.214 μs ┊ GC (median): 0.00% Time (mean ± σ): 6.033 μs ± 479.948 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █ ▅█▁▁ ▁ ▂ █▅▄▃▄▆▇▇▇█▇▇████████▇▇███▇▅▆▅▅▆▅▄▁▄▃▄▅▃▄▄▃▄▅▄▄▁▁▃▁▁▁▄▁▁▁▁▅▆ █ 5.68 μs Histogram: log(frequency) by time 8.1 μs < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for most values that will appear in practice this is 2x faster (but it's 20% slower for values that risk overflow.

mikmoore · 2022-10-20T16:28:46Z

Orthogonal to the changes you've made (but among the same code), I see a benefit to the following substitution

function robust_cinv(c::Float64, d::Float64)
    r = d/c
    z = muladd(d, r, c)
    p = one(r)/z
    q = -r/z
    return p, q
end

The innovation here is to permit the p and q divisions to be done with a vectorized division rather than a sequential divide and multiply. This seems to perform slightly better and might avoid some minor inaccuracies compared to the existing invert-and-multiply for q.

Without your change, this improves the benchmarked runtime from 7.6ns to 5.4ns on my machine. For comparison, your version on your new fastpath benchmarks at 4.1ns.

oscardssmith · 2022-10-20T17:29:57Z

Vectorized divide should almost always be slower.

mikmoore · 2022-10-20T17:58:14Z

Of course division is slower than multiplication. But what's at question here is doing two divisions simultaneously versus doing a scalar division followed by a scalar multiplication that depends on the result.

I benchmarked the change and it was definitely faster for me. Did you see something different? Using your benchmark on my machine, now:

julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64);

# v1.8.0
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.640 μs … 22.440 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.720 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.835 μs ±  1.002 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▂▁                                                       ▁
  ████▆▆▆▄▄▄▃▄▁▃▁▁▁▁▃▃▁▄▁▃▁▁▃▁▅▅▆▅▃▄▄▄▁▄▁▃▃▅▁▃▄▃▃▁▁▁▁▃▁▁▁▁▄▅ █
  6.64 μs      Histogram: log(frequency) by time     12.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# changes to robust_cinv
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.871 μs …  17.814 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.043 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.111 μs ± 734.624 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▇█▃                                                        ▂
  █████▇▆▆▃▅▃▃▃▁▁▃▁▁▁▁▄▃▁▃▁▃▅▄▄▅▄▅▁▃▁▁▁▃▄▄▄▄▅▃▃▁▃▁▃▁▃▄▃▁▃▁▁▅▄ █
  4.87 μs      Histogram: log(frequency) by time      9.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# this PR
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.667 μs …  11.567 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.689 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.755 μs ± 532.814 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▇▆▆▄▄▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▆▆▆▄▅▄▄▃▁▃▃▃▆▄▄▃▄▄▄▆▅▁▃▄▃▁▁▁▁▄▄ █
  2.67 μs      Histogram: log(frequency) by time       5.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT: messed up the benchmarks. Fixed now. Anyway, I should probably just make a separate PR.

oscardssmith · 2022-10-20T18:39:58Z

You're right. I'll add it to this PR.

oscardssmith · 2022-11-07T18:45:26Z

Can someone review? I believe this is ready to merge.

mikmoore · 2022-11-07T22:16:18Z

One philosophical nit:
Without loss of generality, assume abs(d) <= abs(c) and 0 < abs(c) and assume fma is not used by the muladd calls (although this argument only depends on it being applied consistently). My tiny anxiety with this fast path is that, depending purely on the scale of the argument, inv(complex(c,d)) is computed as either complex(c,-d) / (d^2 + c^2) (new fast path) or as complex(1, -d/c) / (d/c*d + c) (slow path, ignoring the anti-overflow scaling -- but the scales are powers of 2 so would not change the result). This introduces the possibility that inv(x)/s != inv(x*s) even when this would have been a preventable disagreement (eg when s = 2.0^n). But by the time one is into complex arithmetic, I don't think one necessarily expects this to hold even for powers of 2. It can introduce some irritating edge cases, although I can't really assert they're anything someone would care about.

That said, the current slow path is frustratingly slow -- especially since it's unnecessary for "typical" numbers. Further, I would imagine that the fast path version is generally more accurate. I think this PR is ultimately an improvement.

Unless someone can raise a credible and practical reason to share my anxiety on the differing computational paths, this looks good to me.

oscardssmith · 2022-11-08T00:08:09Z

Yeah I really wish I could think of a better slow path which would let us delete the fast path, but I somewhat doubt it exists.

* faster `inv` for normal sized `ComplexF64`

The other recursion coefficients were already all using only real number types, but the initial condition was forgotten. This has worked thus far, but a change in the upcoming Julia v1.9 (presumably JuliaLang/julia#47255) broke the exact correspondance between real inputs and real-axis complex inputs for the spherical normalization. Specifically, on Julia 1.8: ```julia julia> r = inv(sqrt(4convert(Float64, π))) 0.28209479177387814 julia> r - real(inv(sqrt(4convert(ComplexF64, π)))) 0.0 ``` while with Julia 1.9: ```julia julia> r = inv(sqrt(4convert(Float64, π))) 0.28209479177387814 julia> r - real(inv(sqrt(4convert(ComplexF64, π)))) -5.551115123125783e-17 ``` The simple change that allows the tests to pass again is to just ask for the initial condition to be calculated in the appropriate real number type.

oscardssmith added performance Must go faster maths Mathematical functions complex Complex numbers labels Oct 20, 2022

slight speedup, improved vectorization.

657b1d0

oscardssmith added 5 commits October 20, 2022 14:41

use 2 divisions rather than 1 div and 2 muls.

a6393b9

fix whitespace.

48e61d1

Merge branch 'master' into oscardssmith-faster-ComplexF64-inv

9b99204

fix doctest

710b94d

Merge branch 'master' into oscardssmith-faster-ComplexF64-inv

9a849e1

mikmoore approved these changes Nov 7, 2022

View reviewed changes

oscardssmith merged commit 83592cf into master Nov 9, 2022

oscardssmith deleted the oscardssmith-faster-ComplexF64-inv branch November 9, 2022 16:39

KristofferC pushed a commit that referenced this pull request Nov 13, 2022

faster inv for normal sized ComplexF64 (#47255)

2d67295

* faster `inv` for normal sized `ComplexF64`

jmert mentioned this pull request Nov 19, 2022

Calculate the initial condition using only real numbers jmert/AssociatedLegendrePolynomials.jl#33

Merged

KristofferC mentioned this pull request Jan 30, 2023

Change in ComplexF64 division #48457

Closed

odow mentioned this pull request Jun 1, 2023

Unravelling the mystery of #861 lanl-ansi/PowerModels.jl#862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster `inv` for normal sized `ComplexF64` #47255

faster `inv` for normal sized `ComplexF64` #47255

oscardssmith commented Oct 20, 2022 •

edited

Loading

mikmoore commented Oct 20, 2022

oscardssmith commented Oct 20, 2022

mikmoore commented Oct 20, 2022 •

edited

Loading

oscardssmith commented Oct 20, 2022

oscardssmith commented Nov 7, 2022

mikmoore commented Nov 7, 2022 •

edited

Loading

oscardssmith commented Nov 8, 2022

faster inv for normal sized ComplexF64 #47255

faster inv for normal sized ComplexF64 #47255

Conversation

oscardssmith commented Oct 20, 2022 • edited Loading

mikmoore commented Oct 20, 2022

oscardssmith commented Oct 20, 2022

mikmoore commented Oct 20, 2022 • edited Loading

oscardssmith commented Oct 20, 2022

oscardssmith commented Nov 7, 2022

mikmoore commented Nov 7, 2022 • edited Loading

oscardssmith commented Nov 8, 2022

faster `inv` for normal sized `ComplexF64` #47255

faster `inv` for normal sized `ComplexF64` #47255

oscardssmith commented Oct 20, 2022 •

edited

Loading

mikmoore commented Oct 20, 2022 •

edited

Loading

mikmoore commented Nov 7, 2022 •

edited

Loading