Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid powi underflow in ratio_to_f64 #94

Merged
merged 1 commit into from
Jun 23, 2022

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented Mar 10, 2021

Fixes #91.

@cuviper
Copy link
Member Author

cuviper commented Mar 10, 2021

I think this may still have double-rounding, but I'm not sure how to avoid that. Perhaps we need to manually create the float from_bits, but that will need to account for a lot of corner cases. At least this PR is better than the status quo...

@MattX
Copy link
Contributor

MattX commented Apr 8, 2022

Yeah, other implementations of this algorithm use ldexp to avoid this issue, but while it's in libm, I don't think it's provided by Rust's stdlib. Maybe worth implementing it here?

bors bot added a commit that referenced this pull request Jun 23, 2022
104: Implement ldexp and use it in ratio_to_f64 r=cuviper a=MattX

This eliminates errors where we attempt a multiplication by 2^-x, but that 2^-x underflows. (#91).

This is based on the code in #94.

The `ldexp` implementation strategy is mostly to extract the exponent, modify it, then reinsert it. To avoid dealing with subnormal numbers at the bit level, we shift any subnormal numbers into normal range before that operation.

Fixes #91.

Co-authored-by: Josh Stone <[email protected]>
Co-authored-by: Matthieu Felix <[email protected]>
@bors bors bot merged commit b69551c into rust-num:master Jun 23, 2022
@cuviper cuviper deleted the to_f64-underflow branch July 21, 2023 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conversion to f64 is lossy for tiny floats
2 participants