Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid overflow for fixed_point round #9809

Merged
merged 8 commits into from
Jan 10, 2022

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Dec 1, 2021

Current PR is to get rid of the potential overflow on fixed_point round caused by too large scale movement.

If the scale movement is larger than the max precision of current type, the pow operation (std::pow(10, scale_movement)) will produce an overflow number as current type. Fortunately, we can simply output a zero column because no digits can survive such a large scale movement.

@sperlingxx sperlingxx requested a review from a team as a code owner December 1, 2021 10:03
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 1, 2021
@sperlingxx sperlingxx added 4 - Needs Review Waiting for reviewer to review or respond non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Dec 1, 2021
@sperlingxx sperlingxx requested a review from revans2 December 1, 2021 10:05
@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #9809 (ceb1c27) into branch-22.02 (967a333) will decrease coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9809      +/-   ##
================================================
- Coverage         10.49%   10.43%   -0.06%     
================================================
  Files               119      119              
  Lines             20305    20458     +153     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18324     +149     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b4f903...ceb1c27. Read the comment docs.

cpp/src/round/round.cu Outdated Show resolved Hide resolved
@jrhemstad jrhemstad requested a review from codereport December 1, 2021 17:47
@jrhemstad
Copy link
Contributor

@sperlingxx please update the PR description with how the current rounding logic is deficient and how it is being improved.

cpp/src/round/round.cu Outdated Show resolved Hide resolved
cpp/src/round/round.cu Outdated Show resolved Hide resolved
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 2, 2021
@sperlingxx
Copy link
Contributor Author

sperlingxx commented Dec 2, 2021

Hi @revans2, I ran the benchmark via Java tests. Current implementation didn't outperform the original one. Here is the detailed results:

original impl:

1000 times average DECIMAL32 HALF_UP : 251 MicroSec.
1000 times average DECIMAL32 HALF_EVEN : 253 MicroSec.
1000 times average DECIMAL64 HALF_UP : 492 MicroSec.
1000 times average DECIMAL64 HALF_EVEN : 494 MicroSec.
1000 times average DECIMAL128 HALF_UP : 975 MicroSec.
1000 times average DECIMAL128 HALF_EVEN : 975 MicroSec.

current impl:

1000 times average DECIMAL32 HALF_UP : 252 MicroSec.
1000 times average DECIMAL32 HALF_EVEN : 252 MicroSec.
1000 times average DECIMAL64 HALF_UP : 493 MicroSec.
1000 times average DECIMAL64 HALF_EVEN : 494 MicroSec.
1000 times average DECIMAL128 HALF_UP : 976 MicroSec.
1000 times average DECIMAL128 HALF_EVEN : 977 MicroSec.

Test script is printed as below:

  @Test
  void decimalRoundPerfTest() {
    int[] dec32Raw = new int[1 << 24];
    Arrays.fill(dec32Raw, 1555555555);
    long startTime, endTime;
    long avgCostHalfUp, avgCostHalfEven;
    avgCostHalfUp = 0L;
    avgCostHalfEven = 0L;
    try (ColumnVector input = ColumnVector.decimalFromInts(-2, dec32Raw)) {
      for (int i = 0; i < 1000; i++) {
        startTime = System.nanoTime();
        try (ColumnVector halfUp = input.round(-5, RoundMode.HALF_UP)) {
          endTime = System.nanoTime();
          avgCostHalfUp += (endTime - startTime) / 1000L;
        }
        startTime = System.nanoTime();
        try (ColumnVector halfEven = input.round(-5, RoundMode.HALF_EVEN)) {
          endTime = System.nanoTime();
          avgCostHalfEven += (endTime - startTime) / 1000L;
        }
      }
    }
    System.err.println("1000 times average DECIMAL32 HALF_UP : " + (avgCostHalfUp / 1000L) + " MicroSec.");
    System.err.println("1000 times average DECIMAL32 HALF_EVEN : " + (avgCostHalfEven / 1000L) + " MicroSec.");

    long[] dec64Raw = new long[1 << 24];
    Arrays.fill(dec64Raw, 1555555555555555555L);
    avgCostHalfUp = 0L;
    avgCostHalfEven = 0L;
    try (ColumnVector input = ColumnVector.decimalFromLongs(-2, dec64Raw)) {
      for (int i = 0; i < 1000; i++) {
        startTime = System.nanoTime();
        try (ColumnVector halfUp = input.round(-14, RoundMode.HALF_UP)) {
          endTime = System.nanoTime();
          avgCostHalfUp += (endTime - startTime) / 1000L;
        }
        startTime = System.nanoTime();
        try (ColumnVector halfEven = input.round(-14, RoundMode.HALF_EVEN)) {
          endTime = System.nanoTime();
          avgCostHalfEven += (endTime - startTime) / 1000L;
        }
      }
    }
    System.err.println("1000 times average DECIMAL64 HALF_UP : " + (avgCostHalfUp / 1000L) + " MicroSec.");
    System.err.println("1000 times average DECIMAL64 HALF_EVEN : " + (avgCostHalfEven / 1000L) + " MicroSec.");

    BigInteger[] dec128Raw = new BigInteger[1 << 24];
    Arrays.fill(dec128Raw, new BigInteger("-155555555555555555555555555555555555"));
    avgCostHalfUp = 0L;
    avgCostHalfEven = 0L;
    try (ColumnVector input = ColumnVector.decimalFromBigInt(-2, dec128Raw)) {
      for (int i = 0; i < 1000; i++) {
        startTime = System.nanoTime();
        try (ColumnVector halfUp = input.round(-30, RoundMode.HALF_UP)) {
          endTime = System.nanoTime();
          avgCostHalfUp += (endTime - startTime) / 1000L;
        }
        startTime = System.nanoTime();
        try (ColumnVector halfEven = input.round(-30, RoundMode.HALF_EVEN)) {
          endTime = System.nanoTime();
          avgCostHalfEven += (endTime - startTime) / 1000L;
        }
      }
    }
    System.err.println("1000 times average DECIMAL128 HALF_UP : " + (avgCostHalfUp / 1000L) + " MicroSec.");
    System.err.println("1000 times average DECIMAL128 HALF_EVEN : " + (avgCostHalfEven / 1000L) + " MicroSec.");
  }

According to the result, I think I should revert most of the change, only keeping the bypass for scale movements which are out of range.

@sperlingxx sperlingxx changed the title Improve round for fixed_point types Avoid overflow for fixed_point round Dec 2, 2021
@sperlingxx
Copy link
Contributor Author

sperlingxx commented Dec 2, 2021

@sperlingxx please update the PR description with how the current rounding logic is deficient and how it is being improved.

Hi @jrhemstad, I have reverted the "improvements", since it doesn't really boost the performance, according to the benchmark tests.

@jrhemstad
Copy link
Contributor

Why are we making overflow return 0 instead of just letting it overflow? We don't do that for any other type. Or with fixed point we could throw an exception since we can detect overflow from the scales alone.

@revans2
Copy link
Contributor

revans2 commented Dec 2, 2021

Why are we making overflow return 0 instead of just letting it overflow? We don't do that for any other type. Or with fixed point we could throw an exception since we can detect overflow from the scales alone.

Because Spark will overflow for ints but not decimal :). We ran into some issues with rounding decimals and put in a fix. We thought it would be good to push the changes back, but if it is going to cause issues with consistency we can keep our current solution.

Technically 0 is the correct answer. If I want to round a DECIMAL32 with a scale of -50 to a scale of 0 it does not matter what the value stored in the decimal is, the result will always be 0. It just happens that the algorithm before would overflow at about this same time. The comment probably should be updated to reflect this instead of just calling out overflow, but that is moot if we are not going to do this change.

For integer types we could run into a similar situation. i.e. I have an INT8 and I ask to round to 5 digits the result would always be 0, but the Spark code overflows in exactly the same way as the CUDF code does. So, I would prefer not to change it.

For floating point rounding is handled by a c function and it looks like if you try to round to a decimal place that is larger than the output can support CUDF returns a NaN while Spark returns a 0.0. So it is kind of a corner case here too.

@revans2
Copy link
Contributor

revans2 commented Dec 2, 2021

Actually for integers it is not always 100% the same as with Spark. But it is close enough that we are okay with it.

@codereport
Copy link
Contributor

Technically 0 is the correct answer. If I want to round a DECIMAL32 with a scale of -50 to a scale of 0 it does not matter what the value stored in the decimal is, the result will always be 0.

Could you elaborate on this? If you have -50 (or 50 in libcudf) you end up with something like 0.000...0123 which will round to 0 when converted to scale 0. However, if you look at the tests added by @sperlingxx, we are going from scale 1 with values like 1.4 and 2.5 to scale 11 with values 0. That seems odd to me.

@revans2
Copy link
Contributor

revans2 commented Dec 7, 2021

Technically 0 is the correct answer. If I want to round a DECIMAL32 with a scale of -50 to a scale of 0 it does not matter what the value stored in the decimal is, the result will always be 0.

Could you elaborate on this?

My example was a bad one. The following should hopefully be much more concrete. Note that this is a corner case in Spark.

The issue we are actually trying to fix/push back is https://github.com/NVIDIA/spark-rapids/blob/1e95655d7fee525bfb12dd445118ba18f5b18296/sql-plugin/src/main/scala/com/nvidia/spark/rapids/DecimalUtil.scala#L71-L99

Type const n = std::pow(10, std::abs(decimal_places + input.type().scale()));

corresponds directly to the check we are making in the Scala code. For example, if my input is a Decimal32, which can hold up to 9 decimal places without overflowing. And I want to go from a Spark Decimal(9, 9) which corresponds to a CUDF Decimal32 scale -9 and round to a Spark Decimal(1, -1) which corresponds to a CUDF Decimal32 scale 1. Then n overflows because it is std::pow(10, abs(-1 + -9)).

Without the fix.

val df = Seq("0.111111111", "0.999999999", "0.987654321").toDF("a")
spark.conf.set("spark.sql.legacy.allowNegativeScaleOfDecimal", "true")
df.repartition(1).selectExpr("round(CAST(a AS Decimal(9, 9)), -1) as v").show()
+---+
|  v|
+---+
| 10|
| 10|
| 10|
+---+

The CPU shows these as 0, which is what I am arguing should be correct.

If I change the precision to be 10, so it ends up as a Decimal64 instead of a Decimal32 n no longer overflows and we get the same answer as the CPU 0.

What @sperlingxx found is that my code is in-efficient and we don't need to truncate the number before rounding, because the result will always be 0.

If you have -50 (or 50 in libcudf) you end up with something like 0.000...0123 which will round to 0 when converted to scale 0. However, if you look at the tests added by @sperlingxx, we are going from scale 1 with values like 1.4 and 2.5 to scale 11 with values 0. That seems odd to me.

I think the scale may be opposite of what you have described (Spark scale vs CUDF scale. If I messed it up somewhere I apologize). The test that is added is going from values like 140 and 250 where the last digit is always 0 because it is scale 1 (i.e. 14 * 10^1 and 25 * 10^1, to something where the decimal_places is -11 which means that the last 11 digits will always be 0.

@codereport
Copy link
Contributor

I think the scale may be opposite of what you have described (Spark scale vs CUDF scale. If I messed it up somewhere I apologize). The test that is added is going from values like 140 and 250 where the last digit is always 0 because it is scale 1 (i.e. 14 * 10^1 and 25 * 10^1, to something where the decimal_places is -11 which means that the last 11 digits will always be 0.

You are correct, I got confused - not you. So it is 140 and 250 turning into 0s when scale goes from 1 to -11. Which will overflow, so I'm still not convinced on the C++ that we would expect 0s. My gut tells me that we would inherit whatever the underlying ints behaviour is (aka overflow).

@revans2 Why does the CPU code result in zeros? I assume this happens on the Spark side of things?

@revans2
Copy link
Contributor

revans2 commented Dec 9, 2021

Spark is using java's BigDecimal to do the rounding for all of the number types. BigDecimal does not care about decimal64 vs decimal32. It computes an answer with effectively unlimited precision and then casts the result back to the desired type.

@revans2
Copy link
Contributor

revans2 commented Dec 9, 2021

If CUDF just wants to overflow because it is more C++ like we can just add some comments to the APIs explaining when they will overflow. I just find it rather odd that the C++ thing to do is to return an answer that is patently wrong instead of one that is arguably correct, especially when there is no added cost to doing so. That just feels counter to any good API design in general.

@codereport
Copy link
Contributor

If CUDF just wants to overflow because it is more C++ like we can just add some comments to the APIs explaining when they will overflow. I just find it rather odd that the C++ thing to do is to return an answer that is patently wrong instead of one that is arguably correct, especially when there is no added cost to doing so. That just feels counter to any good API design in general.

@jrhemstad @harrism What do you think?

@jrhemstad
Copy link
Contributor

If CUDF just wants to overflow because it is more C++ like we can just add some comments to the APIs explaining when they will overflow. I just find it rather odd that the C++ thing to do is to return an answer that is patently wrong instead of one that is arguably correct, especially when there is no added cost to doing so. That just feels counter to any good API design in general.

@jrhemstad @harrism What do you think?

I don't have strong feelings either way. fixed_point is unique in that we can detect overflow from the host and there isn't any performance cost to do something special. Overflow of signed integers is UB anyways, so returning 0 is perfectly fine. There is some maintenance cost, but it's minor, so I'm fine with this.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is fine with me 👍

input.end<Type>(),
out_view.begin<Type>(),
FixedPointRoundFunctor{n});
constexpr int max_precision = cuda::std::numeric_limits<Type>::digits10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be stored as a variable when it is only used in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't. I removed it.

@sperlingxx
Copy link
Contributor Author

rerun tests

@sperlingxx
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8ba8774 into rapidsai:branch-22.02 Jan 10, 2022
@sperlingxx sperlingxx deleted the opt_dec_round branch January 10, 2022 10:17
sperlingxx added a commit to NVIDIA/spark-rapids that referenced this pull request Feb 23, 2022
Signed-off-by: sperlingxx <[email protected]>

Closes #3793

Pushes cuDF-related decimal utilities down to cuDF. This PR is relied on cuDF changes: rapidsai/cudf#9809, rapidsai/cudf#9907 and rapidsai/cudf#10316.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants