[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609

sperlingxx · 2020-10-28T03:00:37Z

This PR supports decimal fetching/appending for (Host)ColumnVector. In specific, with this PR, we are able to get/append java.math.BigDecimal from/to ColumnVector. This is a minimum work to provide essential interface for the plugin-side development.

Considering java.math.BigDecimal is a boxed type, this PR only implemented Builder.appendBoxed method for BigDecimal (left Builder.appendArray unsupported).
Each BigDecimal will be verified before appending to HostColumnVector.Builder, which ensures:
- All scales in same HostColumnVector are consistent.
- Transfer BigDecimal to cuDF::fixed_point without any precison loss. If any overflow occurs, corresponding exception will be thrown.
Decimal types can be used as basic type of nested types.

GPUtester · 2020-10-28T03:01:54Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

codecov · 2020-10-28T05:30:06Z

Codecov Report

Merging #6609 into branch-0.17 will increase coverage by 0.33%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           branch-0.17    #6609      +/-   ##
===============================================
+ Coverage        82.33%   82.67%   +0.33%     
===============================================
  Files               94       91       -3     
  Lines            15369    15103     -266     
===============================================
- Hits             12654    12486     -168     
+ Misses            2715     2617      -98

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/string.py	`86.44% <0.00%> (-0.45%)`	⬇️
python/cudf/cudf/core/dataframe.py	`90.59% <0.00%> (-0.44%)`	⬇️
python/cudf/cudf/core/reshape.py	`89.14% <0.00%> (-0.31%)`	⬇️
python/cudf/cudf/core/column/datetime.py	`88.48% <0.00%> (-0.30%)`	⬇️
python/cudf/cudf/core/column/categorical.py	`93.11% <0.00%> (-0.23%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.91% <0.00%> (-0.08%)`	⬇️
python/cudf/cudf/core/groupby/groupby.py	`93.18% <0.00%> (-0.06%)`	⬇️
python/cudf/cudf/core/frame.py	`89.83% <0.00%> (-0.03%)`	⬇️
python/cudf/cudf/__init__.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/core/buffer.py	`79.04% <0.00%> (ø)`
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cf7106...f6b8e02. Read the comment docs.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

java/src/test/java/ai/rapids/cudf/DecimalColumnVectorTest.java

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

java/src/test/java/ai/rapids/cudf/DecimalColumnVectorTest.java

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

revans2 · 2020-10-29T18:10:44Z

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

+   * Create a new vector from the given values.  This API supports inline nulls, but it is inefficient.
+   * Notice all input BigDecimals should share same scale.
+   */
+  public static HostColumnVector fromDecimals(BigDecimal... values) {


I am a little concerned about this API and would like feedback on it. By default BigDecimal sets the scale dynamically to match best the floating point value passed in and similarly with the precision, unless you pass in a MathContex but that is only for precesion.

https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-double-

This is the same for string representations too. This means that as a user I will likely have to set the scale or every BigDecimal I create.

HostColumnVector.fromDecimals(new BigDecimal(1.1).setScale(1), new BigDecimal(1.2).setScale(1))

Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64. I would have to cast it after creating it. All of this feels cumbersome when what I want to do is to create a column for testing as quickly as possible.

Would it be better to have an API that takes a scale and precision(32/64 bit) followed by a list of floating point values instead/in addition to this one?

With new method decimalFromLongs(int scale, long... unscaledValue), we can store all 0L values into a columnVector backed by DECIMAL64. Another method decimalFromInts(int scale, int... unscaledValue) enables creating columnVector with DECIMAL32.
In terms of floating point values, I am not sure whether to round them in JVM or in GPU side. The former one looks inefficient, the latter one seems depending on round decimal which hasn't been unsupported yet in libcudf ?

Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64.

I thought about this before, but there technically is a way to build it. You'd have to do somehting like this:

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Not super simple, but it is possible.

However the main point of BigDecimal scale being problematic since it can adjust during computation is well taken. Should the BigDecimal form automatically find the largest scale, using that for the scale and adjusting the scales of all other BigDecimal values to match, using the maximum precision after scale adjustments to determine the underlying cudf type? And/Or should the user specify the cudf decimal type along with the list of BigDecimals? Or should we just ditch BigDecimal and have the user translate them to int/long values themselves? 😄

So there are two use cases here that we have to worry about. One of them is the advanced API where what we care about is efficiency. These are the ones where we are moving around ints or longs and making it all fit. In that case we are appending a row at a time some buffering up all the decimal values in an array and then putting them in with

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Is not going to work well.

The second set of APIs need to be about testing and making it clean and simple to build exactly what we want for testing. The main point of my question is not CAN we do it, but is there a cleaner way to do it. How do you want the tests to look? The main question for me is around floating point and do we want to deal with floating point values when working with decimal or do we just want to use Strings for it? Fun Fact when BigDecimal converts from a float it turns the float into a String first.

I tried to add method

public static ColumnVector decimalFromDoubles(DType.DTypeEnum type, int scale, double... values);

to deal with floating point values. The basic idea is from @revans2 's suggestion "an API that takes a scale and precision(32/64 bit) followed by a list of floating point values".
Instead of converting floating point values to strings, decimalFromDoubles rescales float numbers and extracts integral part directly. Since we don't have to handle float numbers who overflows Int_64.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

java/src/main/java/ai/rapids/cudf/HostColumnVectorCore.java

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

jlowe · 2020-10-30T14:44:20Z

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

+   * Create a new vector from the given values.  This API supports inline nulls, but it is inefficient.
+   * Notice all input BigDecimals should share same scale.
+   */
+  public static HostColumnVector fromDecimals(BigDecimal... values) {


Along with this there is no way for me to set the precision on a BigDecimal. There is no way for me to store all 0 values as a DECIMAL64.

I thought about this before, but there technically is a way to build it. You'd have to do somehting like this:

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Not super simple, but it is possible.

However the main point of BigDecimal scale being problematic since it can adjust during computation is well taken. Should the BigDecimal form automatically find the largest scale, using that for the scale and adjusting the scales of all other BigDecimal values to match, using the maximum precision after scale adjustments to determine the underlying cudf type? And/Or should the user specify the cudf decimal type along with the list of BigDecimals? Or should we just ditch BigDecimal and have the user translate them to int/long values themselves? 😄

java/src/test/java/ai/rapids/cudf/DecimalColumnVectorTest.java

java/src/main/java/ai/rapids/cudf/DType.java

revans2 · 2020-10-30T17:53:18Z

java/src/main/java/ai/rapids/cudf/ColumnVector.java

+  /**
+   * Create a new vector from the given values.  This API supports inline nulls,
+   * but is much slower than building from primitive array of unscaledValue.
+   * Notice all input BigDecimals should share same scale.


I think this comment is wrong. This is for fromStrings and I don't think we are doing any string to decimal conversion right now.

java/src/main/java/ai/rapids/cudf/DType.java

revans2 · 2020-10-30T18:08:13Z

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

+   * Compared with scale of [[java.math.BigDecimal]], the scale here represents the opposite meaning.
+   */
+  public static HostColumnVector decimalFromInts(int scale, int... values) {
+    if (-scale > DType.DECIMAL32_MAX_PRECISION) {


So I am a bit confused but how exactly does scale impact precision?

Lets say that we have the following

unscaled value = 1
scale = -20

result should be the same as unscaled_value * 10 ^ scale or 0.00000000000000000001

The precision is just about how many digits the unscaled value can hold. The only limit on scale we need to worry about is if it fits in an int32_t, which this interface by definition enforces.

Yes, the scale check here is unnecessary. Basically, precision is independent from the scale. I've removed these checks. Sorry for making confusion.

revans2 · 2020-10-30T18:15:13Z

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

+   * Create a new vector from the given values.  This API supports inline nulls, but it is inefficient.
+   * Notice all input BigDecimals should share same scale.
+   */
+  public static HostColumnVector fromDecimals(BigDecimal... values) {


So there are two use cases here that we have to worry about. One of them is the advanced API where what we care about is efficiency. These are the ones where we are moving around ints or longs and making it all fit. In that case we are appending a row at a time some buffering up all the decimal values in an array and then putting them in with

ColumnVector.build(DType.create(DTypeEnum.DECIMAL64, myScale), myDecimals.length, (b) -> b.appendArray(myDecimals))

Is not going to work well.

The second set of APIs need to be about testing and making it clean and simple to build exactly what we want for testing. The main point of my question is not CAN we do it, but is there a cleaner way to do it. How do you want the tests to look? The main question for me is around floating point and do we want to deal with floating point values when working with decimal or do we just want to use Strings for it? Fun Fact when BigDecimal converts from a float it turns the float into a String first.

java/src/main/java/ai/rapids/cudf/Scalar.java

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java

java/src/main/java/ai/rapids/cudf/DType.java

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

revans2 · 2020-11-02T15:51:03Z

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

+        }
+        ((IntBuffer)buffer).put((int) scaledDouble);
+      } else {
+        if (scaledDouble > Long.MAX_VALUE || scaledDouble < Long.MIN_VALUE) {


This does not work. A double only has 53 bits of significance. The rest is used for the sign and mantissa. Lets not try and scale the double ourselves. Lets have BigDecimal do it for us and then extract the value. No need to reinvent the wheel here. Plus this API is really only supposed to be for testing so if the conversion is a little slow it is fine.

I've re-implemented this method with BigDecimal's APIs.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

sperlingxx · 2020-11-04T06:36:06Z

Hi @jlowe, I've eliminated the implicit support for appending single ints to DECIMAL64 columnVectors.
And I've replaced the assertion of scale equality with safely rescaling. The corresponding change is the method fromDecimals, in which we build the DecimalType class from rescaled precision and minimum scale.

jlowe

Minor nit about an assert but otherwise looks good. This needs to be upmerged to resolve the merge conflict.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java

Co-authored-by: Jason Lowe <[email protected]>

sperlingxx requested review from jlowe, revans2 and nartal1 October 28, 2020 03:00

sperlingxx requested a review from a team as a code owner October 28, 2020 03:00

sperlingxx changed the title ~~[Review] Support fixed-point decimal in HostColumnVector~~ [Review] Support fixed-point decimal in HostColumnVector [skip ci] Oct 28, 2020

sperlingxx changed the title ~~[Review] Support fixed-point decimal in HostColumnVector [skip ci]~~ [Review] Support fixed-point decimal for HostColumnVector [skip ci] Oct 28, 2020

sperlingxx changed the title ~~[Review] Support fixed-point decimal for HostColumnVector [skip ci]~~ [REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] Oct 28, 2020

jlowe requested changes Oct 28, 2020

View reviewed changes

jlowe added the Java Affects Java cuDF API. label Oct 28, 2020

sperlingxx mentioned this pull request Oct 29, 2020

[FEA] Java bindings for fixed-point Decimal type #6515

Closed

16 tasks

sperlingxx force-pushed the dec_hostcv branch from 9b9ff78 to f877b90 Compare October 29, 2020 09:03

jlowe reviewed Oct 29, 2020

View reviewed changes

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved

jlowe reviewed Oct 29, 2020

View reviewed changes

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved

java/src/test/java/ai/rapids/cudf/DecimalColumnVectorTest.java Outdated Show resolved Hide resolved

revans2 reviewed Oct 29, 2020

View reviewed changes

jlowe reviewed Oct 30, 2020

View reviewed changes

java/src/main/java/ai/rapids/cudf/DType.java Show resolved Hide resolved

revans2 reviewed Oct 30, 2020

View reviewed changes

sperlingxx force-pushed the dec_hostcv branch from f78a1e4 to 21c30f7 Compare November 2, 2020 06:00

jlowe reviewed Nov 2, 2020

View reviewed changes

revans2 reviewed Nov 2, 2020

View reviewed changes

jlowe reviewed Nov 3, 2020

View reviewed changes

jlowe approved these changes Nov 4, 2020

View reviewed changes

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved

sperlingxx added 3 commits November 5, 2020 09:20

enable decimal type in HostColumnVector

13a38c7

fix typo

ed9425b

add changelog line

31551c8

sperlingxx and others added 14 commits November 5, 2020 09:20

enable decimalType within nestedTypes

e57b3d7

Update java/src/test/java/ai/rapids/cudf/DecimalColumnVectorTest.java

eae0a10

Co-authored-by: Jason Lowe <[email protected]>

address comments

c79c3d1

address comments

83a0d89

Update java/src/test/java/ai/rapids/cudf/DecimalColumnVectorTest.java

0413e17

Co-authored-by: Jason Lowe <[email protected]>

a lot of refinement

6452679

some refinement

f56c4d9

addressed comments

a8e9c83

add decimalFromDoubles

69786ed

refine doc for cv.fromDecimals

5951de8

refine

5cdda6f

fix doc

06aba54

refine

29cf5d6

refine

f6b8e02

sperlingxx force-pushed the dec_hostcv branch from dfc680a to f6b8e02 Compare November 5, 2020 01:21

sperlingxx merged commit ada1a0c into rapidsai:branch-0.17 Nov 5, 2020

sperlingxx mentioned this pull request Nov 5, 2020

Revert "[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci]" #6681

Closed

sperlingxx deleted the dec_hostcv branch November 18, 2020 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609

[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609

sperlingxx commented Oct 28, 2020 •

edited

Loading

GPUtester commented Oct 28, 2020

codecov bot commented Oct 28, 2020 •

edited

Loading

revans2 Oct 29, 2020

sperlingxx Oct 30, 2020 •

edited

Loading

jlowe Oct 30, 2020

revans2 Oct 30, 2020

sperlingxx Nov 2, 2020

jlowe Oct 30, 2020

revans2 Oct 30, 2020

revans2 Oct 30, 2020

sperlingxx Nov 2, 2020 •

edited

Loading

revans2 Oct 30, 2020

revans2 Nov 2, 2020

sperlingxx Nov 3, 2020

sperlingxx commented Nov 4, 2020 •

edited

Loading

jlowe left a comment

[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609

[REVIEW] Support fixed-point decimal for HostColumnVector [skip ci] #6609

Conversation

sperlingxx commented Oct 28, 2020 • edited Loading

GPUtester commented Oct 28, 2020

codecov bot commented Oct 28, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

sperlingxx Oct 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Nov 4, 2020 • edited Loading

jlowe left a comment

Choose a reason for hiding this comment

sperlingxx commented Oct 28, 2020 •

edited

Loading

codecov bot commented Oct 28, 2020 •

edited

Loading

sperlingxx Oct 30, 2020 •

edited

Loading

sperlingxx Nov 2, 2020 •

edited

Loading

sperlingxx commented Nov 4, 2020 •

edited

Loading