Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal128 support for Parquet #4362

Merged
merged 12 commits into from
Jan 5, 2022
Merged

Conversation

kuhushukla
Copy link
Collaborator

This is not ready for complete review as we need more tests and support for uint64.

Signed-off-by: Kuhu Shukla <[email protected]>
@sperlingxx
Copy link
Collaborator

It would be nice to add some tests for nested types of decimal128.

@kuhushukla
Copy link
Collaborator Author

It would be nice to add some tests for nested types of decimal128.

Oops. I think I missed a commit. I'm in the process of rebasing. Will update tomorrow with more tests

@sameerz sameerz added the feature request New feature or request label Dec 15, 2021
@sameerz sameerz added this to the Dec 13 - Jan 7 milestone Dec 15, 2021
@sameerz sameerz linked an issue Dec 15, 2021 that may be closed by this pull request
@revans2
Copy link
Collaborator

revans2 commented Dec 16, 2021

Looks good is there any way to get some parquet files with unsigned longs in them to test that code?

sperlingxx
sperlingxx previously approved these changes Dec 20, 2021
Copy link
Collaborator

@sperlingxx sperlingxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good generally.
NIT: Can you add some comments about unsigned 64 conversion?

@@ -796,7 +797,8 @@ trait ParquetPartitionReaderBase extends Logging with Arm with ScanWithMetrics
}

def needDecimalCast(cv: ColumnView, dt: DataType): Boolean = {
cv.getType.isDecimalType && !GpuColumnVector.getNonNestedRapidsType(dt).equals(cv.getType())
cv.getType.isDecimalType && !GpuColumnVector.getNonNestedRapidsType(dt).equals(cv.getType()) ||
cv.getType.equals(DType.UINT64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments to elaborate why UINT64 needs to be casted to decimal here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can :) Coming up shortly.

@kuhushukla
Copy link
Collaborator Author

Looks good is there any way to get some parquet files with unsigned longs in them to test that code?

Yes that would be nice, isn't it? I will try and make that work.

revans2
revans2 previously approved these changes Dec 21, 2021
@kuhushukla
Copy link
Collaborator Author

I cant find a simple way to get uint64s on the fly for testing. Can use any ideas I might have missed? I see we had to check in files earlier as well possibly for the same limitation.

@kuhushukla kuhushukla marked this pull request as ready for review December 21, 2021 15:13
@kuhushukla kuhushukla changed the title WIP. Decimal128 support for Parquet Decimal128 support for Parquet Dec 21, 2021
@kuhushukla
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Dec 21, 2021

FYI:

[2021-12-21T16:18:06.489Z] error file=/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3604/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala message=File line length exceeds 100 characters line=827

@kuhushukla
Copy link
Collaborator Author

@revans2 Pushed commit to fix the missing piece from verify.

@kuhushukla
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Dec 21, 2021
@sperlingxx
Copy link
Collaborator

Hi @kuhushukla, I found this PR also addressed #3475. So, could you add a test case to verify that we are already able to read UINT_64 column from parquet files ?

@kuhushukla
Copy link
Collaborator Author

Adding a parquet file to test UINT64.

== Physical Plan ==
GpuColumnarToRow false
+- GpuShuffleCoalesce 2147483647
   +- GpuColumnarExchange gpusinglepartitioning$(), REPARTITION_BY_NUM, [id=#112465]
      +- GpuFileGpuScan parquet [simple_uint64#39706,arr_uint64#39707] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/kuhus/Reps/rapids-plugin-4-spark/tests/target/spark320/test-classes/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<simple_uint64:decimal(20,0),arr_uint64:array<decimal(20,0)>>

I created the file using pyarrow.

@kuhushukla
Copy link
Collaborator Author

@sperlingxx Please take a look.

Signed-off-by: Kuhu Shukla <[email protected]>
@kuhushukla kuhushukla changed the title Decimal128 support for Parquet [databricks] Decimal128 support for Parquet Dec 22, 2021
@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla
Copy link
Collaborator Author

Looking at the build failure with 32x shims now.

@jlowe
Copy link
Member

jlowe commented Dec 22, 2021

The build failure should be fixed by #4427.

@sperlingxx
Copy link
Collaborator

LGTM

@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla
Copy link
Collaborator Author

build

@kuhushukla kuhushukla changed the title [databricks] Decimal128 support for Parquet Decimal128 support for Parquet Jan 4, 2022
@kuhushukla
Copy link
Collaborator Author

Running the normal - non databricks premerge build for sanity. Also pinging @sperlingxx for the additional look over and approval.

@kuhushukla
Copy link
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
5 participants