Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

Closed
jlowe opened this issue Jul 3, 2023 · 2 comments · Fixed by #14237
Closed

[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

jlowe opened this issue Jul 3, 2023 · 2 comments · Fixed by #14237
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Jul 3, 2023

Describe the bug
Using libcudf to load a Parquet file with a malformed dictionary does not result in an error.

Steps/Code to reproduce bug
Read the nation.dict-malformed.parquet file from apache/parquet-testing with libcudf and note that it does not error.

Expected behavior
Loading this from Apache Spark or via the parquet-cli tool results in a decode error as expected:

java -cp '/home/jlowe/src/parquet-mr/parquet-cli/target/parquet-cli-1.13.1.jar:/home/jlowe/src/parquet-mr/parquet-cli/target/dependency/*' org.apache.parquet.cli.Main cat parquet-testing/data/nation.dict-malformed.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:193)
Caused by: java.lang.RuntimeException: Failed while reading Parquet file: /home/jlowe/src/spark-rapids/thirdparty/parquet-testing/data/nation.dict-malformed.parquet
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:367)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:344)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
	... 3 more
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:1688)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1604)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1547)
	at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1157)
	at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:993)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1082)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
	... 6 more
@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jul 3, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jul 5, 2023
@etseidl
Copy link
Contributor

etseidl commented Jul 17, 2023

It appears that it's not so much that the dictionary is corrupt, but that the file metadata in the footer does not agree with the sizes in the page headers. For the two columns that use a dictionary, the file metadata gives a size for the column chunk that is 15 bytes smaller than the sum of the sizes from the page headers. One place where this could be detected is in gpuDecodePageHeaders, where it would be possible to detect incrementing the bs->cur pointer beyond bs->end. Communicating that back to the host then becomes the challenge.

@etseidl
Copy link
Contributor

etseidl commented Sep 27, 2023

#14167 will help with this one

rapids-bot bot pushed a commit that referenced this issue Oct 20, 2023
Fixes #13656.  Uses the error reporting introduced in #14167 to report errors in header parsing.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #14237
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants