Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609

Merged
merged 17 commits into from
Nov 5, 2020

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Oct 28, 2020

This PR supports decimal fetching/appending for (Host)ColumnVector. In specific, with this PR, we are able to get/append java.math.BigDecimal from/to ColumnVector. This is a minimum work to provide essential interface for the plugin-side development.

  • Considering java.math.BigDecimal is a boxed type, this PR only implemented Builder.appendBoxed method for BigDecimal (left Builder.appendArray unsupported).
  • Each BigDecimal will be verified before appending to HostColumnVector.Builder, which ensures:
    • All scales in same HostColumnVector are consistent.
    • Transfer BigDecimal to cuDF::fixed_point without any precison loss. If any overflow occurs, corresponding exception will be thrown.
  • Decimal types can be used as basic type of nested types.

@sperlingxx sperlingxx requested a review from a team as a code owner October 28, 2020 03:00
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@sperlingxx sperlingxx changed the title [Review] Support fixed-point decimal in HostColumnVector [Review] Support fixed-point decimal in HostColumnVector [skip ci] Oct 28, 2020
@sperlingxx sperlingxx changed the title [Review] Support fixed-point decimal in HostColumnVector [skip ci] [Review] Support fixed-point decimal for HostColumnVector [skip ci] Oct 28, 2020
@sperlingxx sperlingxx changed the title [Review] Support fixed-point decimal for HostColumnVector [skip ci] [REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] Oct 28, 2020
@codecov
Copy link

codecov bot commented Oct 28, 2020

Codecov Report

Merging #6609 into branch-0.17 will increase coverage by 0.33%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #6609      +/-   ##
===============================================
+ Coverage        82.33%   82.67%   +0.33%     
===============================================
  Files               94       91       -3     
  Lines            15369    15103     -266     
===============================================
- Hits             12654    12486     -168     
+ Misses            2715     2617      -98     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 86.44% <0.00%> (-0.45%) ⬇️
python/cudf/cudf/core/dataframe.py 90.59% <0.00%> (-0.44%) ⬇️
python/cudf/cudf/core/reshape.py 89.14% <0.00%> (-0.31%) ⬇️
python/cudf/cudf/core/column/datetime.py 88.48% <0.00%> (-0.30%) ⬇️
python/cudf/cudf/core/column/categorical.py 93.11% <0.00%> (-0.23%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.91% <0.00%> (-0.08%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 93.18% <0.00%> (-0.06%) ⬇️
python/cudf/cudf/core/frame.py 89.83% <0.00%> (-0.03%) ⬇️
python/cudf/cudf/__init__.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/buffer.py 79.04% <0.00%> (ø)
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cf7106...f6b8e02. Read the comment docs.

@jlowe jlowe added the Java Affects Java cuDF API. label Oct 28, 2020
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
* Create a new vector from the given values. This API supports inline nulls, but it is inefficient.
* Notice all input BigDecimals should share same scale.
*/
public static HostColumnVector fromDecimals(BigDecimal... values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about this API and would like feedback on it. By default BigDecimal sets the scale dynamically to match best the floating point value passed in and similarly with the precision, unless you pass in a MathContex but that is only for precesion.

https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-double-

This is the same for string representations too. This means that as a user I will likely have to set the scale or every BigDecimal I create.

HostColumnVector.fromDecimals(new BigDecimal(1.1).setScale(1), new BigDecimal(1.2).setScale(1))

Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64. I would have to cast it after creating it. All of this feels cumbersome when what I want to do is to create a column for testing as quickly as possible.

Would it be better to have an API that takes a scale and precision(32/64 bit) followed by a list of floating point values instead/in addition to this one?

Copy link
Contributor Author

@sperlingxx sperlingxx Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With new method decimalFromLongs(int scale, long... unscaledValue), we can store all 0L values into a columnVector backed by DECIMAL64. Another method decimalFromInts(int scale, int... unscaledValue) enables creating columnVector with DECIMAL32.
In terms of floating point values, I am not sure whether to round them in JVM or in GPU side. The former one looks inefficient, the latter one seems depending on round decimal which hasn't been unsupported yet in libcudf ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64.

I thought about this before, but there technically is a way to build it. You'd have to do somehting like this:

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Not super simple, but it is possible.

However the main point of BigDecimal scale being problematic since it can adjust during computation is well taken. Should the BigDecimal form automatically find the largest scale, using that for the scale and adjusting the scales of all other BigDecimal values to match, using the maximum precision after scale adjustments to determine the underlying cudf type? And/Or should the user specify the cudf decimal type along with the list of BigDecimals? Or should we just ditch BigDecimal and have the user translate them to int/long values themselves? 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two use cases here that we have to worry about. One of them is the advanced API where what we care about is efficiency. These are the ones where we are moving around ints or longs and making it all fit. In that case we are appending a row at a time some buffering up all the decimal values in an array and then putting them in with

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Is not going to work well.

The second set of APIs need to be about testing and making it clean and simple to build exactly what we want for testing. The main point of my question is not CAN we do it, but is there a cleaner way to do it. How do you want the tests to look? The main question for me is around floating point and do we want to deal with floating point values when working with decimal or do we just want to use Strings for it? Fun Fact when BigDecimal converts from a float it turns the float into a String first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add method

public static ColumnVector decimalFromDoubles(DType.DTypeEnum type, int scale, double... values);

to deal with floating point values. The basic idea is from @revans2 's suggestion "an API that takes a scale and precision(32/64 bit) followed by a list of floating point values".
Instead of converting floating point values to strings, decimalFromDoubles rescales float numbers and extracts integral part directly. Since we don't have to handle float numbers who overflows Int_64.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
* Create a new vector from the given values. This API supports inline nulls, but it is inefficient.
* Notice all input BigDecimals should share same scale.
*/
public static HostColumnVector fromDecimals(BigDecimal... values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64.

I thought about this before, but there technically is a way to build it. You'd have to do somehting like this:

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Not super simple, but it is possible.

However the main point of BigDecimal scale being problematic since it can adjust during computation is well taken. Should the BigDecimal form automatically find the largest scale, using that for the scale and adjusting the scales of all other BigDecimal values to match, using the maximum precision after scale adjustments to determine the underlying cudf type? And/Or should the user specify the cudf decimal type along with the list of BigDecimals? Or should we just ditch BigDecimal and have the user translate them to int/long values themselves? 😄

/**
* Create a new vector from the given values. This API supports inline nulls,
* but is much slower than building from primitive array of unscaledValue.
* Notice all input BigDecimals should share same scale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is wrong. This is for fromStrings and I don't think we are doing any string to decimal conversion right now.

java/src/main/java/ai/rapids/cudf/DType.java Outdated Show resolved Hide resolved
* Compared with scale of [[java.math.BigDecimal]], the scale here represents the opposite meaning.
*/
public static HostColumnVector decimalFromInts(int scale, int... values) {
if (-scale > DType.DECIMAL32_MAX_PRECISION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am a bit confused but how exactly does scale impact precision?

Lets say that we have the following

unscaled value = 1
scale = -20

result should be the same as unscaled_value * 10 ^ scale or 0.00000000000000000001

The precision is just about how many digits the unscaled value can hold. The only limit on scale we need to worry about is if it fits in an int32_t, which this interface by definition enforces.

Copy link
Contributor Author

@sperlingxx sperlingxx Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the scale check here is unnecessary. Basically, precision is independent from the scale. I've removed these checks. Sorry for making confusion.

* Create a new vector from the given values. This API supports inline nulls, but it is inefficient.
* Notice all input BigDecimals should share same scale.
*/
public static HostColumnVector fromDecimals(BigDecimal... values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two use cases here that we have to worry about. One of them is the advanced API where what we care about is efficiency. These are the ones where we are moving around ints or longs and making it all fit. In that case we are appending a row at a time some buffering up all the decimal values in an array and then putting them in with

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Is not going to work well.

The second set of APIs need to be about testing and making it clean and simple to build exactly what we want for testing. The main point of my question is not CAN we do it, but is there a cleaner way to do it. How do you want the tests to look? The main question for me is around floating point and do we want to deal with floating point values when working with decimal or do we just want to use Strings for it? Fun Fact when BigDecimal converts from a float it turns the float into a String first.

java/src/main/java/ai/rapids/cudf/Scalar.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
}
((IntBuffer)buffer).put((int) scaledDouble);
} else {
if (scaledDouble > Long.MAX_VALUE || scaledDouble < Long.MIN_VALUE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work. A double only has 53 bits of significance. The rest is used for the sign and mantissa. Lets not try and scale the double ourselves. Lets have BigDecimal do it for us and then extract the value. No need to reinvent the wheel here. Plus this API is really only supposed to be for testing so if the conversion is a little slow it is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-implemented this method with BigDecimal's APIs.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
@sperlingxx
Copy link
Contributor Author

sperlingxx commented Nov 4, 2020

Hi @jlowe, I've eliminated the implicit support for appending single ints to DECIMAL64 columnVectors.
And I've replaced the assertion of scale equality with safely rescaling. The corresponding change is the method fromDecimals, in which we build the DecimalType class from rescaled precision and minimum scale.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit about an assert but otherwise looks good. This needs to be upmerged to resolve the merge conflict.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Affects Java cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants