Support storing `precision` of decimal types in `Schema` class #17176

ttnghia · 2024-10-24T23:24:38Z

In Spark, the DecimalType has a specific number of digits to represent the numbers. However, when creating a data Schema, only type and name of the column are stored, thus we lose that precision information. As such, it would be difficult to reconstruct the original decimal types from cudf's Schema instance.

This PR adds a precision member variable to the Schema class in cudf Java, allowing it to store the precision number of the original decimal column.

Partially contributes to NVIDIA/spark-rapids#11560.

Signed-off-by: Nghia Truong <[email protected]>

revans2

I don't think that this is going to be good from a design standpoint. I also don't think that this solves the issue that you are complaining about.

CUDF does not store precision with their decimal type, so if we round trip the type to CUDF and back (like say in a LIST of DECIMALs) the precision will be lost. That is totally unexpected for a user. CUDF also will not enforce this precision in any way, or pass it on when doing computaion. This precision is just meta data that is going to be thrown away/ignored by CUDF. This violates the principal of least surprise.

We also have ways to include precision for the few places that CUDF uses it. (writing parquet/orc)

cudf/java/src/main/java/ai/rapids/cudf/ColumnWriterOptions.java

Line 32 in e98e6b9

private int precision;

I don't see any value in doing this unless CUDF is going to truly support precision.

ttnghia · 2024-10-25T15:10:31Z

We need to convert a Spark schema into cudf schema. When reading JSON, we also need to convert strings to decimals using the precision from Spark DecimalType. Without storing precision, we would need to create a separate "schema" that stores only column precision values and pass it along to cudf JNI for doing the conversion. Storing precision inside DType would provide more information about the type, although it is not meant to be used by libcudf.

revans2 · 2024-10-25T15:27:10Z

We need to convert a Spark schema into cudf schema. When reading JSON, we also need to convert strings to decimals using the precision from Spark DecimalType. Without storing precision, we would need to create a separate "schema" that stores only column precision values and pass it along to cudf JNI for doing the conversion. Storing precision inside DType would provide more information about the type, although it is not meant to be used by libcudf.

The issue I have with this is that it violates the principal of least surprise. I get that there are use cases where the code will be simpler/cleaner if we can put the precision in with the DType. It would be a lot cleaner if we could have the precision be in the DType when we want to write a parquet or ORC file. But, in my opinion, those benefits don't outweigh the harm caused by someone expecting the precision to be properly reflected everywhere and in reality it is not. For example when

they read a decimal value from parquet/orc and the precision stored in those files does not show up in the DType
they do some kind of mathematical operation on a decimal value and the precision is ignored, and the resulting precision is not technically correct.
they create a nested column with a given precision and that precision is lost when they ask for it back.
They write a parquet file and the precision is ignored, because you have to provide it through a different API.
there are no errors/warnings if the precision violates what the underlying data type can actually hold.

Also technically a precision of 0 is valid (at least by Spark). It can only ever hold the value 0 or null, so it is close to useless. But it is valid.

ttnghia · 2024-10-25T15:54:52Z

Alright, then I close this to avoid producing more "surprise". Thanks Bobby.

This reverts commit 76ab5fb.

Signed-off-by: Nghia Truong <[email protected]>

ttnghia · 2024-10-25T22:24:51Z

Reopen as this can be implemented with only changes in Schema.

Signed-off-by: Nghia Truong <[email protected]>

revans2 · 2024-10-28T20:12:34Z

This is still fundamentally the same issue as before. There are no APIs in CUDF that take a Schema which will use the precision. Schema is used by readJSON and readCSV. If I ask them to return a column with a precision of 5 for a DECIMAL32 with a scale of 0, what would you expect them to do in that case? Would you expect them to ignore the precision request? I wouldn't. But with that said CUDF ignores the DECIMAL part of it anyways and just uses the types as suggestions. I know that with pruning some of that is supposed to change.

This is better because the schema here is not going to be used to round trip information to CUDF and back. But tt still is fundamentally broken. We are making a change to CUDF for something that CUDF just does not and probably will never support. It is here so that some other library, spark-rapids-jni, can provide a simpler API for functionality that goes beyond what CUDF supports. I am not going to go to fight this any more. This does not break things too horribly. But at a minimum we have to document that precision is completely and totally ignored if it is set.

Signed-off-by: Nghia Truong <[email protected]>

ttnghia · 2024-10-28T21:59:39Z

Thanks Bobby. Yes I understand that this is not a good design but in the meantime we seem do not have a better solution. The only workaround I can think of is to keep a separate flattened array of precisions for all columns along with the nested schema, but that is more error prone.

Update: I've added the docs, clearly saying that we add precision only for the JNI layer doing their function (a43b58d).

revans2

I still don't like it, but like I said before I am done fighting it.

ttnghia · 2024-10-29T16:51:14Z

/merge

Add precision variable for DType

76ab5fb

Signed-off-by: Nghia Truong <[email protected]>

ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 24, 2024

ttnghia self-assigned this Oct 24, 2024

ttnghia requested a review from a team as a code owner October 24, 2024 23:24

ttnghia changed the title ~~Add precision variable for DType class in DType.java~~ Support storing precision of decimal type in DType and Schema classes Oct 24, 2024

Add Schema.getFlattenedDecimalPrecisions()

c2e019f

Signed-off-by: Nghia Truong <[email protected]>

ttnghia requested a review from revans2 October 25, 2024 05:03

revans2 requested changes Oct 25, 2024

View reviewed changes

ttnghia closed this Oct 25, 2024

ttnghia added 2 commits October 25, 2024 15:05

Revert "Add precision variable for DType"

1c379f5

This reverts commit 76ab5fb.

Add precision only in Schema

7fcc48e

Signed-off-by: Nghia Truong <[email protected]>

ttnghia reopened this Oct 25, 2024

ttnghia changed the title ~~Support storing precision of decimal type in DType and Schema classes~~ Support storing precision of decimal types Schema class Oct 25, 2024

ttnghia changed the title ~~Support storing precision of decimal types Schema class~~ Support storing precision of decimal types in Schema class Oct 25, 2024

Rename variable

771150c

Signed-off-by: Nghia Truong <[email protected]>

ttnghia force-pushed the add_precision branch from f830e7a to 771150c Compare October 25, 2024 22:36

ttnghia added 2 commits October 25, 2024 15:37

Add comment

218cbfe

Signed-off-by: Nghia Truong <[email protected]>

Reorder variables

fc25729

Signed-off-by: Nghia Truong <[email protected]>

Update docs

a43b58d

Signed-off-by: Nghia Truong <[email protected]>

revans2 approved these changes Oct 29, 2024

View reviewed changes

rapids-bot bot merged commit ddfb284 into rapidsai:branch-24.12 Oct 29, 2024
85 checks passed

ttnghia deleted the add_precision branch October 29, 2024 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support storing `precision` of decimal types in `Schema` class #17176

Support storing `precision` of decimal types in `Schema` class #17176

ttnghia commented Oct 24, 2024 •

edited

Loading

revans2 left a comment

ttnghia commented Oct 25, 2024

revans2 commented Oct 25, 2024

ttnghia commented Oct 25, 2024

ttnghia commented Oct 25, 2024

revans2 commented Oct 28, 2024

ttnghia commented Oct 28, 2024 •

edited

Loading

revans2 left a comment

ttnghia commented Oct 29, 2024

Support storing precision of decimal types in Schema class #17176

Support storing precision of decimal types in Schema class #17176

Conversation

ttnghia commented Oct 24, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

ttnghia commented Oct 25, 2024

revans2 commented Oct 25, 2024

ttnghia commented Oct 25, 2024

ttnghia commented Oct 25, 2024

revans2 commented Oct 28, 2024

ttnghia commented Oct 28, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

ttnghia commented Oct 29, 2024

Support storing `precision` of decimal types in `Schema` class #17176

Support storing `precision` of decimal types in `Schema` class #17176

ttnghia commented Oct 24, 2024 •

edited

Loading

ttnghia commented Oct 28, 2024 •

edited

Loading