-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #6609 +/- ##
===============================================
+ Coverage 82.33% 82.67% +0.33%
===============================================
Files 94 91 -3
Lines 15369 15103 -266
===============================================
- Hits 12654 12486 -168
+ Misses 2715 2617 -98
Continue to review full report at Codecov.
|
9b9ff78
to
f877b90
Compare
* Create a new vector from the given values. This API supports inline nulls, but it is inefficient. | ||
* Notice all input BigDecimals should share same scale. | ||
*/ | ||
public static HostColumnVector fromDecimals(BigDecimal... values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned about this API and would like feedback on it. By default BigDecimal
sets the scale dynamically to match best the floating point value passed in and similarly with the precision, unless you pass in a MathContex but that is only for precesion.
https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-double-
This is the same for string representations too. This means that as a user I will likely have to set the scale or every BigDecimal I create.
HostColumnVector.fromDecimals(new BigDecimal(1.1).setScale(1), new BigDecimal(1.2).setScale(1))
Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64. I would have to cast it after creating it. All of this feels cumbersome when what I want to do is to create a column for testing as quickly as possible.
Would it be better to have an API that takes a scale and precision(32/64 bit) followed by a list of floating point values instead/in addition to this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With new method decimalFromLongs(int scale, long... unscaledValue)
, we can store all 0L
values into a columnVector backed by DECIMAL64. Another method decimalFromInts(int scale, int... unscaledValue)
enables creating columnVector with DECIMAL32.
In terms of floating point values, I am not sure whether to round them in JVM or in GPU side. The former one looks inefficient, the latter one seems depending on round decimal
which hasn't been unsupported yet in libcudf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64.
I thought about this before, but there technically is a way to build it. You'd have to do somehting like this:
ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))
Not super simple, but it is possible.
However the main point of BigDecimal scale being problematic since it can adjust during computation is well taken. Should the BigDecimal form automatically find the largest scale, using that for the scale and adjusting the scales of all other BigDecimal values to match, using the maximum precision after scale adjustments to determine the underlying cudf type? And/Or should the user specify the cudf decimal type along with the list of BigDecimals? Or should we just ditch BigDecimal and have the user translate them to int/long values themselves? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are two use cases here that we have to worry about. One of them is the advanced API where what we care about is efficiency. These are the ones where we are moving around ints or longs and making it all fit. In that case we are appending a row at a time some buffering up all the decimal values in an array and then putting them in with
ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))
Is not going to work well.
The second set of APIs need to be about testing and making it clean and simple to build exactly what we want for testing. The main point of my question is not CAN we do it, but is there a cleaner way to do it. How do you want the tests to look? The main question for me is around floating point and do we want to deal with floating point values when working with decimal or do we just want to use Strings for it? Fun Fact when BigDecimal converts from a float it turns the float into a String first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to add method
public static ColumnVector decimalFromDoubles(DType.DTypeEnum type, int scale, double... values);
to deal with floating point values. The basic idea is from @revans2 's suggestion "an API that takes a scale and precision(32/64 bit) followed by a list of floating point values".
Instead of converting floating point values to strings, decimalFromDoubles
rescales float numbers and extracts integral part directly. Since we don't have to handle float numbers who overflows Int_64.
* Create a new vector from the given values. This API supports inline nulls, but it is inefficient. | ||
* Notice all input BigDecimals should share same scale. | ||
*/ | ||
public static HostColumnVector fromDecimals(BigDecimal... values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64.
I thought about this before, but there technically is a way to build it. You'd have to do somehting like this:
ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))
Not super simple, but it is possible.
However the main point of BigDecimal scale being problematic since it can adjust during computation is well taken. Should the BigDecimal form automatically find the largest scale, using that for the scale and adjusting the scales of all other BigDecimal values to match, using the maximum precision after scale adjustments to determine the underlying cudf type? And/Or should the user specify the cudf decimal type along with the list of BigDecimals? Or should we just ditch BigDecimal and have the user translate them to int/long values themselves? 😄
/** | ||
* Create a new vector from the given values. This API supports inline nulls, | ||
* but is much slower than building from primitive array of unscaledValue. | ||
* Notice all input BigDecimals should share same scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is wrong. This is for fromStrings
and I don't think we are doing any string to decimal conversion right now.
* Compared with scale of [[java.math.BigDecimal]], the scale here represents the opposite meaning. | ||
*/ | ||
public static HostColumnVector decimalFromInts(int scale, int... values) { | ||
if (-scale > DType.DECIMAL32_MAX_PRECISION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am a bit confused but how exactly does scale impact precision?
Lets say that we have the following
unscaled value = 1
scale = -20
result should be the same as unscaled_value * 10 ^ scale
or 0.00000000000000000001
The precision is just about how many digits the unscaled value can hold. The only limit on scale we need to worry about is if it fits in an int32_t
, which this interface by definition enforces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the scale check here is unnecessary. Basically, precision is independent from the scale. I've removed these checks. Sorry for making confusion.
* Create a new vector from the given values. This API supports inline nulls, but it is inefficient. | ||
* Notice all input BigDecimals should share same scale. | ||
*/ | ||
public static HostColumnVector fromDecimals(BigDecimal... values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are two use cases here that we have to worry about. One of them is the advanced API where what we care about is efficiency. These are the ones where we are moving around ints or longs and making it all fit. In that case we are appending a row at a time some buffering up all the decimal values in an array and then putting them in with
ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))
Is not going to work well.
The second set of APIs need to be about testing and making it clean and simple to build exactly what we want for testing. The main point of my question is not CAN we do it, but is there a cleaner way to do it. How do you want the tests to look? The main question for me is around floating point and do we want to deal with floating point values when working with decimal or do we just want to use Strings for it? Fun Fact when BigDecimal converts from a float it turns the float into a String first.
f78a1e4
to
21c30f7
Compare
} | ||
((IntBuffer)buffer).put((int) scaledDouble); | ||
} else { | ||
if (scaledDouble > Long.MAX_VALUE || scaledDouble < Long.MIN_VALUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work. A double only has 53 bits of significance. The rest is used for the sign and mantissa. Lets not try and scale the double ourselves. Lets have BigDecimal do it for us and then extract the value. No need to reinvent the wheel here. Plus this API is really only supposed to be for testing so if the conversion is a little slow it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-implemented this method with BigDecimal's APIs.
Hi @jlowe, I've eliminated the implicit support for appending single ints to DECIMAL64 columnVectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit about an assert but otherwise looks good. This needs to be upmerged to resolve the merge conflict.
Co-authored-by: Jason Lowe <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
dfc680a
to
f6b8e02
Compare
This PR supports decimal fetching/appending for (Host)ColumnVector. In specific, with this PR, we are able to get/append
java.math.BigDecimal
from/to ColumnVector. This is a minimum work to provide essential interface for the plugin-side development.java.math.BigDecimal
is a boxed type, this PR only implementedBuilder.appendBoxed
method for BigDecimal (leftBuilder.appendArray
unsupported).