Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious overflow in Rational-to-Float16 conversion #52394

Closed
stevengj opened this issue Dec 4, 2023 · 4 comments · Fixed by #52395 · May be fixed by #49749
Closed

spurious overflow in Rational-to-Float16 conversion #52394

stevengj opened this issue Dec 4, 2023 · 4 comments · Fixed by #52395 · May be fixed by #49749
Labels
bug Indicates an unexpected problem or unintended behavior float16 maths Mathematical functions rationals The Rational type and values thereof

Comments

@stevengj
Copy link
Member

stevengj commented Dec 4, 2023

It would be nice if this gave a correctly rounded result rather than overflowing:

julia> Float16(10^9 // (10^9 + 1)) # spurious overflow
NaN16

julia> Float16(Float64(10^9 // (10^9 + 1))) # correctly rounded
Float16(1.0)

It's not specific to Float16 arithmetic, in principle, but that's where it shows up most easily since Float32 and Float64 cannot be overflowed by Int.

(I ran into this when implementing a recursive pairwise mean in #52365 (comment))

@stevengj stevengj added maths Mathematical functions float16 labels Dec 4, 2023
@stevengj stevengj changed the title spurious overflow in Rational-to-float conversion spurious overflow in Rational-to-Float16 conversion Dec 4, 2023
@stevengj
Copy link
Member Author

stevengj commented Dec 4, 2023

A simple workaround would be to define something like:

Float16(x::Rational{<:Union{Int16,Int32,Int64,Int128,UInt16,UInt32,UInt64,UInt128}}) = Float16(Float32(x))

@stevengj stevengj added the bug Indicates an unexpected problem or unintended behavior label Dec 5, 2023
@giordano giordano added the rationals The Rational type and values thereof label Dec 5, 2023
@nsajko
Copy link
Contributor

nsajko commented Dec 8, 2023

#49749 is supposed to fix this as far as I remember, but I still haven't gotten around to finishing it.

@andrewjradcliffe
Copy link
Contributor

andrewjradcliffe commented Dec 13, 2023

It's not specific to Float16 arithmetic, in principle, but that's where it shows up most easily since Float32 and Float64 cannot be overflowed by Int.

Indeed, one has to go out of their way to cause it with Float32, but the same problem exists for any Rational{UInt128} where denominator > numerator and numerator >= 2^(emax + 1) - 2^(emax - precision), i.e. numerator >= the largest finite floating point number (2^(emax + 1) - 2^(emax + 1 - precision)) plus half a unit in the last place (which in the last binade is 2^(emax - precision)).

Float32 has a maximum exponent of 127 and precision of 24, thus, the condition is numerator >= 2^128 - 2^103.

# unsigned wraparound makes this succinct. Or use `BigInt` and convert to `UInt128`.
Float32(-(UInt128(1) << 103) // (-(UInt128(1) << 103) + 1))

# Naturally, this only applies to numerators that satisfy the above condition
# after reduction to smallest terms.
Float32(-(UInt128(1) << 103) // (-(UInt128(1) << 103) + 2))

One should leave out Int16 from the blanket definition for Float16 conversion. Float16's maximum exponent of 15 and precision of 11 imply numerator >= 2^16 - 2^4, which is greater than the magnitude of any Int16.

nsajko added a commit to nsajko/julia that referenced this issue Jan 13, 2024
Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 14, 2024
Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 14, 2024
Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 14, 2024
Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 14, 2024
Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 15, 2024
Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 15, 2024
Constructing a floating-point number from a `Rational` should now be
correctly rounded.

Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
nsajko added a commit to nsajko/julia that referenced this issue Jan 15, 2024
Constructing a floating-point number from a `Rational` should now be
correctly rounded.

Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2024

Will this help with #36423? Would you be willing to take a look at #51944 as well and see if that is okay to merge? Seems like we have had a few Rational-related PRs piling up without a confident review

vtjnash pushed a commit that referenced this issue Feb 7, 2024
Fixes #52394.

Also fixes `Float32` for `UInt128`, since currently
`Float32((typemax(UInt128)-0x01) // typemax(UInt128))` gives `Nan32`.
nsajko added a commit to nsajko/julia that referenced this issue May 12, 2024
Constructing a floating-point number from a `Rational` should now be
correctly rounded.

Implementation approach:

1. Convert the (numerator, denominator) pair to a (sign bit, integral
   significand, exponent) triplet using integer arithmetic. The integer
   type in question must be wide enough.

2. Convert the above triplet into an instance of the chosen FP type.
   There is special support for IEEE 754 floating-point and for
   `BigFloat`, otherwise a fallback using `ldexp` is used.

As a bonus, constructing a `BigFloat` from a `Rational` should now be
thread-safe when the rounding mode and precision are provided to the
constructor, because there is no access to the global precision or
rounding mode settings.

Updates JuliaLang#45213

Updates JuliaLang#50940

Updates JuliaLang#52507

Fixes JuliaLang#52394

Closes JuliaLang#52395

Fixes JuliaLang#52859
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior float16 maths Mathematical functions rationals The Rational type and values thereof
Projects
None yet
5 participants