-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid overflow for fixed_point round #9809
Conversation
Signed-off-by: sperlingxx <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9809 +/- ##
================================================
- Coverage 10.49% 10.43% -0.06%
================================================
Files 119 119
Lines 20305 20458 +153
================================================
+ Hits 2130 2134 +4
- Misses 18175 18324 +149
Continue to review full report at Codecov.
|
@sperlingxx please update the PR description with how the current rounding logic is deficient and how it is being improved. |
Hi @revans2, I ran the benchmark via Java tests. Current implementation didn't outperform the original one. Here is the detailed results:
Test script is printed as below: @Test
void decimalRoundPerfTest() {
int[] dec32Raw = new int[1 << 24];
Arrays.fill(dec32Raw, 1555555555);
long startTime, endTime;
long avgCostHalfUp, avgCostHalfEven;
avgCostHalfUp = 0L;
avgCostHalfEven = 0L;
try (ColumnVector input = ColumnVector.decimalFromInts(-2, dec32Raw)) {
for (int i = 0; i < 1000; i++) {
startTime = System.nanoTime();
try (ColumnVector halfUp = input.round(-5, RoundMode.HALF_UP)) {
endTime = System.nanoTime();
avgCostHalfUp += (endTime - startTime) / 1000L;
}
startTime = System.nanoTime();
try (ColumnVector halfEven = input.round(-5, RoundMode.HALF_EVEN)) {
endTime = System.nanoTime();
avgCostHalfEven += (endTime - startTime) / 1000L;
}
}
}
System.err.println("1000 times average DECIMAL32 HALF_UP : " + (avgCostHalfUp / 1000L) + " MicroSec.");
System.err.println("1000 times average DECIMAL32 HALF_EVEN : " + (avgCostHalfEven / 1000L) + " MicroSec.");
long[] dec64Raw = new long[1 << 24];
Arrays.fill(dec64Raw, 1555555555555555555L);
avgCostHalfUp = 0L;
avgCostHalfEven = 0L;
try (ColumnVector input = ColumnVector.decimalFromLongs(-2, dec64Raw)) {
for (int i = 0; i < 1000; i++) {
startTime = System.nanoTime();
try (ColumnVector halfUp = input.round(-14, RoundMode.HALF_UP)) {
endTime = System.nanoTime();
avgCostHalfUp += (endTime - startTime) / 1000L;
}
startTime = System.nanoTime();
try (ColumnVector halfEven = input.round(-14, RoundMode.HALF_EVEN)) {
endTime = System.nanoTime();
avgCostHalfEven += (endTime - startTime) / 1000L;
}
}
}
System.err.println("1000 times average DECIMAL64 HALF_UP : " + (avgCostHalfUp / 1000L) + " MicroSec.");
System.err.println("1000 times average DECIMAL64 HALF_EVEN : " + (avgCostHalfEven / 1000L) + " MicroSec.");
BigInteger[] dec128Raw = new BigInteger[1 << 24];
Arrays.fill(dec128Raw, new BigInteger("-155555555555555555555555555555555555"));
avgCostHalfUp = 0L;
avgCostHalfEven = 0L;
try (ColumnVector input = ColumnVector.decimalFromBigInt(-2, dec128Raw)) {
for (int i = 0; i < 1000; i++) {
startTime = System.nanoTime();
try (ColumnVector halfUp = input.round(-30, RoundMode.HALF_UP)) {
endTime = System.nanoTime();
avgCostHalfUp += (endTime - startTime) / 1000L;
}
startTime = System.nanoTime();
try (ColumnVector halfEven = input.round(-30, RoundMode.HALF_EVEN)) {
endTime = System.nanoTime();
avgCostHalfEven += (endTime - startTime) / 1000L;
}
}
}
System.err.println("1000 times average DECIMAL128 HALF_UP : " + (avgCostHalfUp / 1000L) + " MicroSec.");
System.err.println("1000 times average DECIMAL128 HALF_EVEN : " + (avgCostHalfEven / 1000L) + " MicroSec.");
} According to the result, I think I should revert most of the change, only keeping the bypass for scale movements which are out of range. |
Hi @jrhemstad, I have reverted the "improvements", since it doesn't really boost the performance, according to the benchmark tests. |
Why are we making overflow return 0 instead of just letting it overflow? We don't do that for any other type. Or with fixed point we could throw an exception since we can detect overflow from the scales alone. |
Because Spark will overflow for ints but not decimal :). We ran into some issues with rounding decimals and put in a fix. We thought it would be good to push the changes back, but if it is going to cause issues with consistency we can keep our current solution. Technically 0 is the correct answer. If I want to round a DECIMAL32 with a scale of -50 to a scale of 0 it does not matter what the value stored in the decimal is, the result will always be 0. It just happens that the algorithm before would overflow at about this same time. The comment probably should be updated to reflect this instead of just calling out overflow, but that is moot if we are not going to do this change. For integer types we could run into a similar situation. i.e. I have an INT8 and I ask to round to 5 digits the result would always be 0, but the Spark code overflows in exactly the same way as the CUDF code does. So, I would prefer not to change it. For floating point rounding is handled by a c function and it looks like if you try to round to a decimal place that is larger than the output can support CUDF returns a NaN while Spark returns a 0.0. So it is kind of a corner case here too. |
Actually for integers it is not always 100% the same as with Spark. But it is close enough that we are okay with it. |
Could you elaborate on this? If you have -50 (or 50 in libcudf) you end up with something like 0.000...0123 which will round to 0 when converted to scale 0. However, if you look at the tests added by @sperlingxx, we are going from scale 1 with values like 1.4 and 2.5 to scale 11 with values 0. That seems odd to me. |
My example was a bad one. The following should hopefully be much more concrete. Note that this is a corner case in Spark. The issue we are actually trying to fix/push back is https://github.com/NVIDIA/spark-rapids/blob/1e95655d7fee525bfb12dd445118ba18f5b18296/sql-plugin/src/main/scala/com/nvidia/spark/rapids/DecimalUtil.scala#L71-L99 Line 255 in 8c82d6a
corresponds directly to the check we are making in the Scala code. For example, if my input is a Decimal32, which can hold up to 9 decimal places without overflowing. And I want to go from a Spark Without the fix.
The CPU shows these as If I change the precision to be 10, so it ends up as a Decimal64 instead of a Decimal32 What @sperlingxx found is that my code is in-efficient and we don't need to truncate the number before rounding, because the result will always be 0.
I think the scale may be opposite of what you have described (Spark scale vs CUDF scale. If I messed it up somewhere I apologize). The test that is added is going from values like |
You are correct, I got confused - not you. So it is 140 and 250 turning into 0s when scale goes from 1 to -11. Which will overflow, so I'm still not convinced on the C++ that we would expect 0s. My gut tells me that we would inherit whatever the underlying @revans2 Why does the CPU code result in zeros? I assume this happens on the Spark side of things? |
Spark is using java's BigDecimal to do the rounding for all of the number types. BigDecimal does not care about decimal64 vs decimal32. It computes an answer with effectively unlimited precision and then casts the result back to the desired type. |
If CUDF just wants to overflow because it is more C++ like we can just add some comments to the APIs explaining when they will overflow. I just find it rather odd that the C++ thing to do is to return an answer that is patently wrong instead of one that is arguably correct, especially when there is no added cost to doing so. That just feels counter to any good API design in general. |
@jrhemstad @harrism What do you think? |
I don't have strong feelings either way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is fine with me 👍
cpp/src/round/round.cu
Outdated
input.end<Type>(), | ||
out_view.begin<Type>(), | ||
FixedPointRoundFunctor{n}); | ||
constexpr int max_precision = cuda::std::numeric_limits<Type>::digits10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be stored as a variable when it is only used in one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't. I removed it.
rerun tests |
@gpucibot merge |
Signed-off-by: sperlingxx <[email protected]> Closes #3793 Pushes cuDF-related decimal utilities down to cuDF. This PR is relied on cuDF changes: rapidsai/cudf#9809, rapidsai/cudf#9907 and rapidsai/cudf#10316.
Current PR is to get rid of the potential overflow on fixed_point round caused by too large scale movement.
If the scale movement is larger than the max precision of current type, the pow operation (
std::pow(10, scale_movement)
) will produce an overflow number as current type. Fortunately, we can simply output a zero column because no digits can survive such a large scale movement.