-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider Optimizing Decimal Operations #93
Comments
@Alexhuszagh Of course a PR is invited!!! That would be a fantastic contribution. |
OK I've just finished the minimal implementation (including correctness checks), so I'll start porting this to C++ now and also submit a PR once that's done to the Rust fork. The benchmarks once the implementation was finished is as follows (log scale on the y-axis): There's a few features that you may wish to cherry-pick, so I'll make the implementation multiple different commits with an explanation of each. |
This looks like fantastic work!!! You can be certain that I will review with care. |
Feature Request
Although the code for parsing decimals is quite nice already, a fair amount of work I've done on lexical (my Rust float parser) shows that when the slow path is invoked, we can get quite faster performance in a few cases, and a minimal version of this can prove quite easy, readable, and correct.
This is due to the use of big-integer arithmetic and a few algorithms optimized for big integer math for parsing decimal strings, which minimizes the number of operations relative to the decimal class in fast_float. There is more code involved, however, this code relies on fewer magic numbers and instead uses simple, big-integer algorithms instead. The only additional static storage required is ~32 64-bit integers, which is considerably smaller than the decimal implementation.
This performance gap is quite noticeable with floats like
"8.988465674311580536566680e307"
, where the performance differences goes from ~500ns/iter (lexical) to ~14.465us/iter (fast_float). For denormal floats or those with negative exponents like"8.442911973260991817129021e-309"
, the performance is slightly better, from ~58.073 ns/iter (lexical) ~69.264ns/iter (fast_float). This scales very well also with input size: scaling well to 768 digits (767, the max for a double-precision float, with 1 extra) and beyond.The actual implementation is quite simple, in fact, it uses a fairly minimal subset of big-integer arithmetic, which can be implemented in <700 lines of code, and then the parsing algorithms are quite easy to understand and implement. If this is of interest, I'd be happy to submit a PR.
Positive Exponent Implementation
Here is a Rust version of the code required to implement this, obviously, this would be translated to C++:
Here, the exponent is relative to the significant digits, and where
ExtendedFloat80
is justAdjustedMantissa
or an 80-bit extended-precision float with a biased exponent. Ourbigmant
is just the significant digits parsed as a big integer.hi64
is a function that just gets the high 64-bits from big integer (for the significant digits), and checks if any bits below are truncated (for rounding), andEXPONENT_BIAS
is1075
for a 64-bit float.This implementation is quite simple: first we get the max power of the radix (in this case, 10) that can be stored in a limb, or
10^N <= 2^BITS - 1
. For 32-bit limbs, this is 9, for 64-bit limbs, this is 19. Next, we get the max, native integer for this power (or 10^9 for 32-bit limbs, 10^19 for 64-bit limbs). Then, we parse the maximum number of digits we can into a native limb, then add the limb to the big integer, or effectively the following logic:We then parse using the following logic:
In total, we need operations for the following:
The entirety of the big integer algorithms is <700 lines of code, although my version is extensively documented. The only one that here that's remotely interesting is the power algorithm, which uses a single pre-computed large power-of-5, and a a small number of pre-computed small powers. This is effectively just:
Which is effectively self-explanatory: use large powers of 5 and the large, grade-school multiplication algorithm to minimize the number of multiplications, and then use a small-precomputed table for the remainder. The real power implementation splits this into a power-of-5 multiplication and a left-shift (for the power-of-2).
Negative Exponent Implementation
The negative exponent implementation is likewise simple, but a little different: we need our big mantissa and exponent from before, but we also needed an extended-float prior to rounding. In order to make this work with the existing Lemire algorithm, I simply add
i16::MIN
(-2^15
) rather than set the exponent to-1
, so the real extended-float can be passed over. This requires only trivial modifications to existing code, which has no impact on performance for faster algorithms.First, we create a big integer representing
b+h
, so we can determine if we round tob+u
or tob
. This is very simple:Next, we then calculate both the numerator and denominator of a ratio representing the float:
Finally, we compare these two approximations, and determine if we round-up or down:
Specifics
First, we use 64-bit limbs platforms with native 128-bit multiplication is supported or 64-bit multiplication where you can extract both the high and the low bits efficiently. In practice, this means practically every 64-bit architecture except for SPARCv8 and SPARCv9 uses 64-bit limbs. This is detected by the code compiled with
gcc main.c -c -S -O3 -masm=intel
.If the compiled code has
call __multi3
, then the 128-bit multiplication is emulated by GCC. Otherwise, there is native platform support. Architectures with 128-bit support include:MUL
).DMULTU
, whichHI
andLO
can be read-from).MLGR
).And architectures where 64-bit limbs are more efficient than 32-bit limbs are:
UMULH
andMUL
to capture high and low bits).MULHDU
andMULLD
to capture high and low bits).MUL
andMULH
to capture high and low bits).Caveats
This might be slower on 32-bit architectures or those without native 64-bit multiplication support. However, most modern architectures have native 64-bit or 128-bit multiplication.
Proof of Concept Code
A working implementation for the big integer primitives can be found here. The slow path algorithms can be found here, for
positive_digit_comp
andnegative_digit_comp
. Note that this exact implementation hasn't been fully tested, but is a minimal fork from existing lexical, so comparable code has been battle tested from nearly 3 years of use in production by millions of users.Benchmarks
The benchmarks were run on the following values, using the
fast-float-rust
implementation integrated into Rust standard:These are meant to represent floats that use
positive_digit_comp
ornegative_digit_comp
.The microbenchmark results are as follows:
As you can see, the big integer algorithm outperforms the decimal implementation in every case, and performs exceptionally well in a few notable edge-cases. This was tested on x86_64, so benchmarks on other architectures may be required... (specifically, 32-bit architectures and ARM-64, which may have slightly less-efficient use of 64-bit limbs).
License
I own all the copyright to the aforementioned code, and am happy to provide it, in a PR, under any license terms, including public domain. So, no licensing issues exist.
The text was updated successfully, but these errors were encountered: