spurious overflow in Rational-to-Float16 conversion #52394

stevengj · 2023-12-04T22:43:39Z

It would be nice if this gave a correctly rounded result rather than overflowing:

julia> Float16(10^9 // (10^9 + 1)) # spurious overflow
NaN16

julia> Float16(Float64(10^9 // (10^9 + 1))) # correctly rounded
Float16(1.0)

It's not specific to Float16 arithmetic, in principle, but that's where it shows up most easily since Float32 and Float64 cannot be overflowed by Int.

(I ran into this when implementing a recursive pairwise mean in #52365 (comment))

The text was updated successfully, but these errors were encountered:

stevengj · 2023-12-04T22:47:34Z

A simple workaround would be to define something like:

Float16(x::Rational{<:Union{Int16,Int32,Int64,Int128,UInt16,UInt32,UInt64,UInt128}}) = Float16(Float32(x))

nsajko · 2023-12-08T18:58:11Z

#49749 is supposed to fix this as far as I remember, but I still haven't gotten around to finishing it.

andrewjradcliffe · 2023-12-13T02:13:33Z

It's not specific to Float16 arithmetic, in principle, but that's where it shows up most easily since Float32 and Float64 cannot be overflowed by Int.

Indeed, one has to go out of their way to cause it with Float32, but the same problem exists for any Rational{UInt128} where denominator > numerator and numerator >= 2^(emax + 1) - 2^(emax - precision), i.e. numerator >= the largest finite floating point number (2^(emax + 1) - 2^(emax + 1 - precision)) plus half a unit in the last place (which in the last binade is 2^(emax - precision)).

Float32 has a maximum exponent of 127 and precision of 24, thus, the condition is numerator >= 2^128 - 2^103.

# unsigned wraparound makes this succinct. Or use `BigInt` and convert to `UInt128`.
Float32(-(UInt128(1) << 103) // (-(UInt128(1) << 103) + 1))

# Naturally, this only applies to numerators that satisfy the above condition
# after reduction to smallest terms.
Float32(-(UInt128(1) << 103) // (-(UInt128(1) << 103) + 2))

One should leave out Int16 from the blanket definition for Float16 conversion. Float16's maximum exponent of 15 and precision of 11 imply numerator >= 2^16 - 2^4, which is greater than the magnitude of any Int16.

Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859

Constructing a floating-point number from a `Rational` should now be correctly rounded. Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859

vtjnash · 2024-02-07T02:16:25Z

Will this help with #36423? Would you be willing to take a look at #51944 as well and see if that is okay to merge? Seems like we have had a few Rational-related PRs piling up without a confident review

Fixes #52394. Also fixes `Float32` for `UInt128`, since currently `Float32((typemax(UInt128)-0x01) // typemax(UInt128))` gives `Nan32`.

Constructing a floating-point number from a `Rational` should now be correctly rounded. Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859

stevengj added maths Mathematical functions float16 labels Dec 4, 2023

stevengj changed the title ~~spurious overflow in Rational-to-float conversion~~ spurious overflow in Rational-to-Float16 conversion Dec 4, 2023

stevengj added the bug Indicates an unexpected problem or unintended behavior label Dec 5, 2023

stevengj mentioned this issue Dec 5, 2023

fix spurious overflow for Float16(::Rational) #52395

Merged

giordano added the rationals The Rational type and values thereof label Dec 5, 2023

nsajko mentioned this issue Jan 13, 2024

Base: correctly rounded floats constructed from rationals #49749

Open

vtjnash closed this as completed in #52395 Feb 7, 2024

vtjnash pushed a commit that referenced this issue Feb 7, 2024

fix spurious overflow for Float16(::Rational) (#52395)

bead1d3

Fixes #52394. Also fixes `Float32` for `UInt128`, since currently `Float32((typemax(UInt128)-0x01) // typemax(UInt128))` gives `Nan32`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spurious overflow in Rational-to-Float16 conversion #52394

spurious overflow in Rational-to-Float16 conversion #52394

stevengj commented Dec 4, 2023 •

edited

Loading

stevengj commented Dec 4, 2023 •

edited

Loading

nsajko commented Dec 8, 2023

andrewjradcliffe commented Dec 13, 2023 •

edited

Loading

vtjnash commented Feb 7, 2024

spurious overflow in Rational-to-Float16 conversion #52394

spurious overflow in Rational-to-Float16 conversion #52394

Comments

stevengj commented Dec 4, 2023 • edited Loading

stevengj commented Dec 4, 2023 • edited Loading

nsajko commented Dec 8, 2023

andrewjradcliffe commented Dec 13, 2023 • edited Loading

vtjnash commented Feb 7, 2024

stevengj commented Dec 4, 2023 •

edited

Loading

stevengj commented Dec 4, 2023 •

edited

Loading

andrewjradcliffe commented Dec 13, 2023 •

edited

Loading