-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
^(::Float, ::Integer) #42031
^(::Float, ::Integer) #42031
Conversation
uses compensated power by squaring for Float64. For Float32, this just using power by squaring with Float64. .5 ULP accuracy for both types.
That's a dramatic increase in memory use (several orders of magnitude) for only halving the computational time. Also, the accuracy should be proven by numerical analysis for all input and with testing only showing that there are no known cases of a particular platform violating the assumptions of that analysis. |
Good point. The allocations were because I was incorrectly refering to a method which didn't exist (old name for same method) which was creating a type instability. Fixing that makes the result a lot faster too. Benchmarks updated. I still need to do the analysis on this, but assuming I haven't screwed up the implementation, it should be guaranteed to be very accurate. The Float32 version performs a maximum of 60 Float64 multiplications and additions (without any that could have catastrophic cancellation, and the Float64 method performs a power by squaring using compensated summation that should ensure very small error. |
The new benchmarks show no allocations which I can't replicate locally (both 1.6.2 and at master |
What are you seeing? |
julia> @benchmark @. y64 = $x64^101
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 66.955 μs … 132.993 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 67.261 μs ┊ GC (median): 0.00%
Time (mean ± σ): 68.300 μs ± 4.078 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃ ▁▇▃ ▁
██▅████▆▅▇▇▇▆▄▆▃▅▃▃▁▁▁▁▃▁▁▃▄▃▅▃▃▅▃▃▆▅▅▅▄▅▆▆▆▆▆▆▄▅▄▅▆▅▄▄▄▁▄▅▅ █
67 μs Histogram: log(frequency) by time 90.3 μs <
Memory estimate: 48 bytes, allocs estimate: 3.
julia> versioninfo()
Julia Version 1.8.0-DEV.432
Commit 6814f2b3dc* (2021-08-27 14:01 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) E-2288G CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake) |
The benchmark above has
This is on an M1 mac. |
Co-authored-by: Kristoffer Carlsson <[email protected]>
@palday I don't have a final answer for error analysis yet, but here's what I have so far. In this case, the algorithm simplifies to.
Theoretically, we could get an upper bound on this error, by adding up all the terms (at their maximum), but the intuitive answer is that since every time, the loop introduce very roughly |
FMA is in basically all platforms, CPUs and Nvidia GPUs, since at least 2011 (and in IEEE 754-2008 revision). Can't we just swallow the slowdown in older CPUs, make a policy change to use FMA (not just for this) given the huge upside on newer (LTS will for a year+ work for older platforms without slowdown)? Does LLVM generate the slower code, instead of the instruction, with same accuracy for older platforms? Could we make |
@PallHaraldsson It's not uncommon for big academic compute infrastructure to have processors that are 10 years out of date, but to have a lot of them in a high memory set up. That way you can benefit from massive parallelism and they're not spending millions every couple of years to upgrade. |
Adding a triage label to decide whether it's acceptable for to have |
The new implementation needs |
This actually already returns |
as it should because of
This isn't tested with |
Are you sure? I'm not at a computer right now so can't check but I'm 99% confident |
From triage:
Some of us think a good ordering is to do option 3 first, and whenever 1 is implemented those systems will just get faster. |
@samuel3008 I figured out what was happening. I was hitting literal_pow which was giving different results. |
I was trying to do this as efficiently as possible. using |
All tests passing. Anyone have objections to me merging in the next few days? |
Merging tomorrow unless there are objections. |
Don't forget to squash :) |
[EDIT: Again a bug, looking into it] I did extensive benchmarking of your version vs mine, and I didn't see a benefit to your extra code for e.g. power of three (so I would drop all special cases for that and specific powers to keep machine code short, since it's inlined):
|
The reason to special case p=3 isn't for performance. It's so that |
The reason I added the code here rather than remove the special case for literal_pow is that I there are a lot of people who rely on the fact that
which is a 2x increase in instructions. |
If you wanted to avoid fma (for ancient hardware, avoiding slowness), then my code does that for power of 3 and 2 because of "n -= 2", while yours only for power of 3. You state "lot of people who rely on the fact that x^3 is really fast" which my code seems to be, but above you wrote "special case p=3 isn't for performance". |
I'm not doing this to avoid slow fma. I'm doing it because |
FWIW, it is probably a good idea to run PkgEval on PRs like this that changes very low level functions. |
uses compensated power by squaring for Float64. For Float32, this just using power by squaring with Float64. .5 ULP accuracy for both types.
Old timings:
New timings:
Furthermore, this gets .5 ULP accuracy on all inputs I've tested so far, and is even faster than what I've shown for smaller powers.