Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster inv for normal sized ComplexF64 #47255

Merged
merged 7 commits into from
Nov 9, 2022

Conversation

oscardssmith
Copy link
Member

@oscardssmith oscardssmith commented Oct 20, 2022

Found by https://discourse.julialang.org/t/how-to-optimize-computation-within-vectorized-list-operation-and-large-array/89008

julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64);
#Old
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.467 μs …  10.603 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.983 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.838 μs ± 368.716 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂                   ██▁                                    ▂
  ██▄▅▄▄▃▁▅▆▇▇▆▇▆▆▆▇▇▇▇███▇▇▇█▇███▇▆▆▆▅▆▅▇▆▆▇▇▅▄▅▃▅▄▅▃▅▄▁▃▃▁▃ █
  5.47 μs      Histogram: log(frequency) by time      6.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
#new
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.937 μs …   5.240 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.041 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.052 μs ± 133.940 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂█▆▁    ▅▃█▁                                         
  ▁▂▅▆▄▂▁▂████▆▃▄█████▃▃▂▂▂▇▅▃▃▂▃▂▂▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.94 μs         Histogram: frequency by time        2.32 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> xc64 = 1e200.+rand(ComplexF64, 1024); yc64=similar(xc64);
#old
julia> @benchmark @. $yc64 = inv($xc64)
@b^[[ABenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.465 μs …  10.222 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.482 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.749 μs ± 360.455 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                          ▇▂                               ▁
  █▆▃▄▄▅▃▄▄▅▅▅▆▇▅▆▆▆▆▇▆▇▄▅▆▆▆██▇▆▇▇▇▇█▆▇▇▇▆▆▇▇▇▆▆▆▅▆▆▆▆▆▄▅▅▄▄ █
  5.46 μs      Histogram: log(frequency) by time      6.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
#new
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.697 μs …  14.716 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.727 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.850 μs ± 454.134 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█                     ▅▃                                   ▁
  ███▅▄▁▄▃▃▅▅▆▇▇▇▆▇▆▇▇▇▇███▇▇▇▇▇▇▇▇▆▆▆▆▆▅▆▆▆▄▅▆▆▆▅▆▆▅▄▅▅▅▃▄▅▆ █
  5.7 μs       Histogram: log(frequency) by time      7.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So for most values that will appear in practice this is 2x faster (but it's 20% slower for values that risk overflow.

```
julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64);
#Old
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.467 μs …  10.603 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.983 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.838 μs ± 368.716 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂                   ██▁                                    ▂
  ██▄▅▄▄▃▁▅▆▇▇▆▇▆▆▆▇▇▇▇███▇▇▇█▇███▇▆▆▆▅▆▅▇▆▆▇▇▅▄▅▃▅▄▅▃▅▄▁▃▃▁▃ █
  5.47 μs      Histogram: log(frequency) by time      6.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
#new
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.008 μs …   6.809 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.118 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.209 μs ± 301.481 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▆▄█▂▄█▅▂▂▃▁ ▁▂ ▁                                           ▂
  ██████████████████▇███▇▆▆▆▆▆▆▅▆▅▆▆▅▅▆▆▆▅▅▆▅▆▆▅▄▅▅▅▄▅▂▄▅▅▃▂▄ █
  2.01 μs      Histogram: log(frequency) by time      3.73 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> xc64 = 1e200.+rand(ComplexF64, 1024); yc64=similar(xc64);
#old
julia> @benchmark @. $yc64 = inv($xc64)
@b^[[ABenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.465 μs …  10.222 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.482 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.749 μs ± 360.455 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                          ▇▂                               ▁
  █▆▃▄▄▅▃▄▄▅▅▅▆▇▅▆▆▆▆▇▆▇▄▅▆▆▆██▇▆▇▇▇▇█▆▇▇▇▆▆▇▇▇▆▆▆▅▆▆▆▆▆▄▅▅▄▄ █
  5.46 μs      Histogram: log(frequency) by time      6.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
#new
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.683 μs …  18.619 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.214 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.033 μs ± 479.948 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █           ▅█▁▁        ▁                                   ▂
  █▅▄▃▄▆▇▇▇█▇▇████████▇▇███▇▅▆▅▅▆▅▄▁▄▃▄▅▃▄▄▃▄▅▄▄▁▁▃▁▁▁▄▁▁▁▁▅▆ █
  5.68 μs      Histogram: log(frequency) by time       8.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
```
So for most values that will appear in practice this is 2x faster (but it's 20% slower for values that risk overflow.
@oscardssmith oscardssmith added performance Must go faster maths Mathematical functions complex Complex numbers labels Oct 20, 2022
@mikmoore
Copy link
Contributor

Orthogonal to the changes you've made (but among the same code), I see a benefit to the following substitution

function robust_cinv(c::Float64, d::Float64)
    r = d/c
    z = muladd(d, r, c)
    p = one(r)/z
    q = -r/z
    return p, q
end

The innovation here is to permit the p and q divisions to be done with a vectorized division rather than a sequential divide and multiply. This seems to perform slightly better and might avoid some minor inaccuracies compared to the existing invert-and-multiply for q.

Without your change, this improves the benchmarked runtime from 7.6ns to 5.4ns on my machine. For comparison, your version on your new fastpath benchmarks at 4.1ns.

@oscardssmith
Copy link
Member Author

Vectorized divide should almost always be slower.

@mikmoore
Copy link
Contributor

mikmoore commented Oct 20, 2022

Of course division is slower than multiplication. But what's at question here is doing two divisions simultaneously versus doing a scalar division followed by a scalar multiplication that depends on the result.

I benchmarked the change and it was definitely faster for me. Did you see something different? Using your benchmark on my machine, now:

julia> xc64 = rand(ComplexF64, 1024); yc64=similar(xc64);

# v1.8.0
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min  max):  6.640 μs  22.440 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     6.720 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.835 μs ±  1.002 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▂▁                                                       ▁
  ████▆▆▆▄▄▄▃▄▁▃▁▁▁▁▃▃▁▄▁▃▁▁▃▁▅▅▆▅▃▄▄▄▁▄▁▃▃▅▁▃▄▃▃▁▁▁▁▃▁▁▁▁▄▅ █
  6.64 μs      Histogram: log(frequency) by time     12.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# changes to robust_cinv
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.871 μs   17.814 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     5.043 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.111 μs ± 734.624 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▇█▃                                                        ▂
  █████▇▆▆▃▅▃▃▃▁▁▃▁▁▁▁▄▃▁▃▁▃▅▄▄▅▄▅▁▃▁▁▁▃▄▄▄▄▅▃▃▁▃▁▃▁▃▄▃▁▃▁▁▅▄ █
  4.87 μs      Histogram: log(frequency) by time      9.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# this PR
julia> @benchmark @. $yc64 = inv($xc64)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.667 μs   11.567 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.689 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.755 μs ± 532.814 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▇▆▆▄▄▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▆▆▆▄▅▄▄▃▁▃▃▃▆▄▄▃▄▄▄▆▅▁▃▄▃▁▁▁▁▄▄ █
  2.67 μs      Histogram: log(frequency) by time       5.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT: messed up the benchmarks. Fixed now. Anyway, I should probably just make a separate PR.

@oscardssmith
Copy link
Member Author

You're right. I'll add it to this PR.

@oscardssmith
Copy link
Member Author

Can someone review? I believe this is ready to merge.

@mikmoore
Copy link
Contributor

mikmoore commented Nov 7, 2022

One philosophical nit:
Without loss of generality, assume abs(d) <= abs(c) and 0 < abs(c) and assume fma is not used by the muladd calls (although this argument only depends on it being applied consistently). My tiny anxiety with this fast path is that, depending purely on the scale of the argument, inv(complex(c,d)) is computed as either complex(c,-d) / (d^2 + c^2) (new fast path) or as complex(1, -d/c) / (d/c*d + c) (slow path, ignoring the anti-overflow scaling -- but the scales are powers of 2 so would not change the result). This introduces the possibility that inv(x)/s != inv(x*s) even when this would have been a preventable disagreement (eg when s = 2.0^n). But by the time one is into complex arithmetic, I don't think one necessarily expects this to hold even for powers of 2. It can introduce some irritating edge cases, although I can't really assert they're anything someone would care about.

That said, the current slow path is frustratingly slow -- especially since it's unnecessary for "typical" numbers. Further, I would imagine that the fast path version is generally more accurate. I think this PR is ultimately an improvement.

Unless someone can raise a credible and practical reason to share my anxiety on the differing computational paths, this looks good to me.

@oscardssmith
Copy link
Member Author

Yeah I really wish I could think of a better slow path which would let us delete the fast path, but I somewhat doubt it exists.

@oscardssmith oscardssmith merged commit 83592cf into master Nov 9, 2022
@oscardssmith oscardssmith deleted the oscardssmith-faster-ComplexF64-inv branch November 9, 2022 16:39
KristofferC pushed a commit that referenced this pull request Nov 13, 2022
* faster `inv` for normal sized `ComplexF64`
jmert added a commit to jmert/AssociatedLegendrePolynomials.jl that referenced this pull request Nov 19, 2022
The other recursion coefficients were already all using only real
number types, but the initial condition was forgotten. This has worked
thus far, but a change in the upcoming Julia v1.9 (presumably
JuliaLang/julia#47255) broke the exact correspondance between real
inputs and real-axis complex inputs for the spherical normalization.

Specifically, on Julia 1.8:
```julia
julia> r = inv(sqrt(4convert(Float64, π)))
0.28209479177387814

julia> r - real(inv(sqrt(4convert(ComplexF64, π))))
0.0
```
while with Julia 1.9:
```julia
julia> r = inv(sqrt(4convert(Float64, π)))
0.28209479177387814

julia> r - real(inv(sqrt(4convert(ComplexF64, π))))
-5.551115123125783e-17
```

The simple change that allows the tests to pass again is to just ask
for the initial condition to be calculated in the appropriate real
number type.
jmert added a commit to jmert/AssociatedLegendrePolynomials.jl that referenced this pull request Nov 19, 2022
The other recursion coefficients were already all using only real
number types, but the initial condition was forgotten. This has worked
thus far, but a change in the upcoming Julia v1.9 (presumably
JuliaLang/julia#47255) broke the exact correspondance between real
inputs and real-axis complex inputs for the spherical normalization.

Specifically, on Julia 1.8:
```julia
julia> r = inv(sqrt(4convert(Float64, π)))
0.28209479177387814

julia> r - real(inv(sqrt(4convert(ComplexF64, π))))
0.0
```
while with Julia 1.9:
```julia
julia> r = inv(sqrt(4convert(Float64, π)))
0.28209479177387814

julia> r - real(inv(sqrt(4convert(ComplexF64, π))))
-5.551115123125783e-17
```

The simple change that allows the tests to pass again is to just ask
for the initial condition to be calculated in the appropriate real
number type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complex Complex numbers maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants