Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support storing precision of decimal types in Schema class #17176

Merged
merged 8 commits into from
Oct 29, 2024

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Oct 24, 2024

In Spark, the DecimalType has a specific number of digits to represent the numbers. However, when creating a data Schema, only type and name of the column are stored, thus we lose that precision information. As such, it would be difficult to reconstruct the original decimal types from cudf's Schema instance.

This PR adds a precision member variable to the Schema class in cudf Java, allowing it to store the precision number of the original decimal column.

Partially contributes to NVIDIA/spark-rapids#11560.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 24, 2024
@ttnghia ttnghia self-assigned this Oct 24, 2024
@ttnghia ttnghia requested a review from a team as a code owner October 24, 2024 23:24
@ttnghia ttnghia changed the title Add precision variable for DType class in DType.java Support storing precision of decimal type in DType and Schema classes Oct 24, 2024
@ttnghia ttnghia requested a review from revans2 October 25, 2024 05:03
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is going to be good from a design standpoint. I also don't think that this solves the issue that you are complaining about.

CUDF does not store precision with their decimal type, so if we round trip the type to CUDF and back (like say in a LIST of DECIMALs) the precision will be lost. That is totally unexpected for a user. CUDF also will not enforce this precision in any way, or pass it on when doing computaion. This precision is just meta data that is going to be thrown away/ignored by CUDF. This violates the principal of least surprise.

We also have ways to include precision for the few places that CUDF uses it. (writing parquet/orc)

I don't see any value in doing this unless CUDF is going to truly support precision.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 25, 2024

We need to convert a Spark schema into cudf schema. When reading JSON, we also need to convert strings to decimals using the precision from Spark DecimalType. Without storing precision, we would need to create a separate "schema" that stores only column precision values and pass it along to cudf JNI for doing the conversion. Storing precision inside DType would provide more information about the type, although it is not meant to be used by libcudf.

@revans2
Copy link
Contributor

revans2 commented Oct 25, 2024

We need to convert a Spark schema into cudf schema. When reading JSON, we also need to convert strings to decimals using the precision from Spark DecimalType. Without storing precision, we would need to create a separate "schema" that stores only column precision values and pass it along to cudf JNI for doing the conversion. Storing precision inside DType would provide more information about the type, although it is not meant to be used by libcudf.

The issue I have with this is that it violates the principal of least surprise. I get that there are use cases where the code will be simpler/cleaner if we can put the precision in with the DType. It would be a lot cleaner if we could have the precision be in the DType when we want to write a parquet or ORC file. But, in my opinion, those benefits don't outweigh the harm caused by someone expecting the precision to be properly reflected everywhere and in reality it is not. For example when

  • they read a decimal value from parquet/orc and the precision stored in those files does not show up in the DType
  • they do some kind of mathematical operation on a decimal value and the precision is ignored, and the resulting precision is not technically correct.
  • they create a nested column with a given precision and that precision is lost when they ask for it back.
  • They write a parquet file and the precision is ignored, because you have to provide it through a different API.
  • there are no errors/warnings if the precision violates what the underlying data type can actually hold.

Also technically a precision of 0 is valid (at least by Spark). It can only ever hold the value 0 or null, so it is close to useless. But it is valid.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 25, 2024

Alright, then I close this to avoid producing more "surprise". Thanks Bobby.

@ttnghia ttnghia closed this Oct 25, 2024
@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 25, 2024

Reopen as this can be implemented with only changes in Schema.

@ttnghia ttnghia reopened this Oct 25, 2024
@ttnghia ttnghia changed the title Support storing precision of decimal type in DType and Schema classes Support storing precision of decimal types Schema class Oct 25, 2024
@ttnghia ttnghia changed the title Support storing precision of decimal types Schema class Support storing precision of decimal types in Schema class Oct 25, 2024
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@revans2
Copy link
Contributor

revans2 commented Oct 28, 2024

This is still fundamentally the same issue as before. There are no APIs in CUDF that take a Schema which will use the precision. Schema is used by readJSON and readCSV. If I ask them to return a column with a precision of 5 for a DECIMAL32 with a scale of 0, what would you expect them to do in that case? Would you expect them to ignore the precision request? I wouldn't. But with that said CUDF ignores the DECIMAL part of it anyways and just uses the types as suggestions. I know that with pruning some of that is supposed to change.

This is better because the schema here is not going to be used to round trip information to CUDF and back. But tt still is fundamentally broken. We are making a change to CUDF for something that CUDF just does not and probably will never support. It is here so that some other library, spark-rapids-jni, can provide a simpler API for functionality that goes beyond what CUDF supports. I am not going to go to fight this any more. This does not break things too horribly. But at a minimum we have to document that precision is completely and totally ignored if it is set.

Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 28, 2024

Thanks Bobby. Yes I understand that this is not a good design but in the meantime we seem do not have a better solution. The only workaround I can think of is to keep a separate flattened array of precisions for all columns along with the nested schema, but that is more error prone.

Update: I've added the docs, clearly saying that we add precision only for the JNI layer doing their function (a43b58d).

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like it, but like I said before I am done fighting it.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 29, 2024

/merge

@rapids-bot rapids-bot bot merged commit ddfb284 into rapidsai:branch-24.12 Oct 29, 2024
85 checks passed
@ttnghia ttnghia deleted the add_precision branch October 29, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants