Improve Performance in Float Parsing for Decimal Slow-Path Algorithm #88656
Labels
A-floating-point
Area: Floating point numbers and arithmetic
C-bug
Category: This is a bug.
T-libs
Relevant to the library team, which will review and decide on the PR/issue.
Issue
After updating the Rust float parsing algorithms to improve performance, I've noticed in my own code that performance can be optimized still for slow-path algorithms, using a limited subset of big-integer arithmetic, an implementation that is easy-to-understand, safe, requires little code, and is at least 6x faster than the existing decimal implementation. The new code is extremely efficient, and out-performs all other float parsing implementations for up to 6400 digits (and very likely higher), often by 10x or more. It even beats glibc's implementation in all cases (which although is slow for common cases is very efficient for its slow-path algorithm), which last I checked uses GMP under-the-hood, a very good metric of performance.
Detailed Description
In order to accurately parse floats, in near-halfway cases (described in depth in the PR referenced above), we need to use a large integer representation of our significant digits, since a truncated representation has rounding error that is ambiguous with a halfway point, obscuring rounding.
The existing implementation uses a fixed-width decimal type, which has a few notable drawbacks:
An improvement is to use the fact we can ensure we are within 1 ULP of the correct result from the Lemire algorithm, and therefore use big-integer arithmetic to determine how to round. This is specialized for two cases: where the exponent is positive relative to the significant digits, and when it's negative.
The fix is relatively simple, and is described in-depth in a PR to the C++ implementation of which Rust's dec2flt is based on:
This is quite easy, since we only need a stack-allocated vector with enough bits to hold
10^(769 + 342)
, or ~4000 bits, where769
is the maximum number of significant digits required to accurately round a float (with an extra 2), and342
is the largest magnitude of an exponent (for the smallest denormal float,5e-342
). Our vector-like class only needs a few methods:Since the vector only will contain integer types, we should't need to worry about dropping them manually if using
MaybeUninit
. A reference implementation can be seen here.The only real operations we need are the following: scalar addition, scalar multiplication, addition and multiplication of a scalar value to a big integer, and addition and multiplication between two big integers. Everything else is derived trivially. We can use simple algorithms as well, for the number of limbs we use, asymptotically faster algorithms like Karatsuba multiplication are actually slower. This means we can use algorithms like grade-school multiplication, which is easily shown to be correct.
We then need to make sure the limbs are normalized, which means that any most-significant limbs with zero values are removed. In little-endian limb order, this is very easy since we just pop from the end of the vector. We then need algorithms to exponentiate by a power of 2, 5, and 10.
Exponentiating by a power-of-two is quite easy, since we can merely shift the bits: break the shift into bits within a limb, and the number of limbs to shift:
The implementation of both
shl_bits
andshl_limbs
is shown in the reference implementation below. To exponentiate by a power-of-5, we can do it iteratively by multiplication by powers-of-5. In order to do so efficiently, we use 1 pre-computed power-of-5. or5^135
, which is chosen because it's 5 times the largest native power, or5^27
, and showed good performance in benchmarks.This requires minimal static storage (smaller than the decimal class's requirements), is intuitive (and therefore easy to validate), and performant.
Finally, we can use that
10^N := 5^N * 2^N
, so we can implement efficient multiplication by a power-of-10 as multiplication by a power-of-5 and bitshifts.A reference implementation can be seen here.
Note: Our algorithm to parse the significant digits as a big-integer is uninteresting, but can be seen here. It will not be described in depth.
On 64-bit architectures, almost all tested systems seem to support efficient 128-bit multiplication except for SPARC (v8 and v9). The code used to ensure this can be found here. Tested systems that are known to explicitly support 128-bit multiplication include x86_64, MIPS64, and s390x. Platforms where 64-bit high and low multiplication is supported include ARM64, PPC64, and RISC-V64. On all 32-bit systems and SPARC, we use 32-bit limbs, otherwise, we use 64-bit limbs.
This is quite easy: if we have a float like
12345e300
, we can break this down as12345 * 10^300
, which then means we can create a big-integer representation, and merely pool the most-significant 64-bits. In the case of an exact tie with a halfway point, we can check if any of the truncated bits are non-zero, and use that to direct round-nearest, tie-even.A reference implementation can be found here.
This is slightly trickier, but also quite simple: say we have a float like
12345e-300
. we can break this down as12345 / 10^300
. However, dividing two big integers is not ideal, since big-integer division is slow. A much faster approach is to create a theoretical representation of the digits of the halfway point, and then compare it to our actual digits.For this, I'll use the following terminology:
b
is the float rounded-down, or below the halfway point.b+u
is the next positive float larger thanb
, andb+h
is exactly halfway between them (cannot be represented natively).To create a native representation, we can first round-down our extended-precision float calculated from the Eisel-Lemire algorithm to
b
and convert it to an extended-representation. This is shown here.Next, want to scale both the real significant digits and the real significant digits to be to the same exponent, using only multiplication. We therefore have
theor_digits * 2^X
, andreal_digits * 10^Y
, whereY < 0
. We can then comparetheor_digits * 2^X cmp real_digits * 10^Y
, which means we can re-arrange it astheor_digits * 2^X / 10^-Y cmp real_digits
. We can break this down astheor_digits * 2^(X-Y) / 5^-Y cmp real_digits
. In practice, this is quite simple to implement, and quite efficient:Now, we just need to compare the real and significant digits, and direct our rounding for
fp
(our extended-precision float) accordingly:b
, rather than a generic error.This is quite simple, since we already have an extended-precision representation is within the interval
[b, b+u]
. We must add a marker to ensure we know when an error occurs (we use a biased exponent less than 0, additionally biased byi16::MIN
for this purpose), and ensure we have an accurate representation of our significant digits.To do this, we add two functions, which are just small extensions of the existing algorithm:
Then, if we are not within a safe exponent range (within the range
[-27, 55]
), we returned a scaled error from our already scaled significant digits:Likewise, if our significant digits were truncated and the algorithm for
mantissa + 1
gives us a different result than the original pass, we can compute the non-scaled error:Although we could technically make this more efficient, it's extremely cheap computationally relative to big-integer arithmetic, it's easy to justify logically, and it avoids any performance regression for common cases.
A reference implementation can be seen here.
Pull Request
If there is interest, I will gladly submit a PR. I own all of the aforementioned code, including the algorithms involved, so there are no licensing issues whatsoever.
The text was updated successfully, but these errors were encountered: