Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly improve accuracy of stod in to_floats #10622

Merged

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Apr 8, 2022

Reference #10599

Provides a slight improvement in accuracy for the internal stod device function used by the cudf::strings::to_floats() API.

Reduces the number of floating-point operations by 1 and also applies the exponent by conditionally multiplying or dividing depending on it being positive or negative. This slightly improves accuracy of the result since multiplying decimal fractions in floating point can compound errors.

>>> s = cudf.Series(['1.0','2.0','0.1','0.2','0.3'])
>>> x = cudf.to_numeric(s)
>>> x[0]
1.0        previously 0.9999999999999999
>>> x[1]
2.0        previously 1.9999999999999998
>>> x[2]
0.1        previously 0.09999999999999999
>>> x[3]
0.2        previously 0.19999999999999998
>>> x[4]
0.3        same

The 1.0 floating-point value in bits was 3FEFFFFFFFFFFFFF and now computes to 3FF0000000000000 which is 1.0.
The 0.1 floating-point value in bits was 3FB9999999999999 and now computes to 3FB999999999999A which is now 0.10000000000000001 so the error is the same as 0.09999999999999999 but both are within expected epsilon.

Since the overall error is within std::numerics<T>::epsilon() error threshold, no tests had to be modified.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 8, 2022
@davidwendt davidwendt requested a review from a team as a code owner April 8, 2022 14:55
@davidwendt davidwendt self-assigned this Apr 8, 2022
@codecov
Copy link

codecov bot commented Apr 8, 2022

Codecov Report

Merging #10622 (18a565b) into branch-22.06 (956c7b5) will increase coverage by 0.03%.
The diff coverage is 88.97%.

@@               Coverage Diff                @@
##           branch-22.06   #10622      +/-   ##
================================================
+ Coverage         86.30%   86.34%   +0.03%     
================================================
  Files               140      140              
  Lines             22255    22280      +25     
================================================
+ Hits              19207    19237      +30     
+ Misses             3048     3043       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 94.75% <ø> (+1.02%) ⬆️
python/dask_cudf/dask_cudf/tests/test_accessor.py 98.41% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 91.77% <87.93%> (-0.87%) ⬇️
python/cudf/cudf/core/column/lists.py 90.62% <100.00%> (+0.57%) ⬆️
python/cudf/cudf/core/dataframe.py 93.59% <100.00%> (ø)
python/cudf/cudf/core/series.py 95.28% <100.00%> (-0.01%) ⬇️
python/cudf/cudf/core/column/column.py 89.45% <0.00%> (+0.10%) ⬆️
python/cudf/cudf/core/column/string.py 89.10% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.72% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb03c8b...18a565b. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, LGTM. I assume unifying the different methods in #10599 is pending a longer discussion, perhaps about porting to libcu++ eventually?

@davidwendt
Copy link
Contributor Author

davidwendt commented Apr 8, 2022

Nice find, LGTM. I assume unifying the different methods in #10599 is pending a longer discussion, perhaps about porting to libcu++ eventually?

Yes. The current discussion is about having cuIO try to reuse this stod function. The original complaint was the results were different and so Spark changed their code from using to_floats() back to using cuIO instead. I thought this change would possibly allow them to move back to using to_floats() until #10599 is resolved. @andygrove

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 012af64 into rapidsai:branch-22.06 Apr 11, 2022
@davidwendt davidwendt deleted the stod-accuracy-improvement branch April 11, 2022 19:45
rapids-bot bot pushed a commit that referenced this pull request Apr 23, 2022
Fixes a rounding error on extremely small floating-point numbers in the range `1E-287 - 1E-307`. These values were incorrectly being rounded to zero due to the fix in #10622. The extra float operation removed in #10622 is necessary for values in this range to keep them from being converted to zero.

The fix adds a check so the extra floating point operation is only used when the overall exponent falls below `std::numeric_limits<double>::min_exponent10` (which is `-307`). The `ToFloat64` gtest was also updated to include value in this range to ensure this error does not occur again.

Additionally, the conversion now supports subnormal numbers that are very very small in the range of E-307 and E-324.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #10672
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants