[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

jlowe · 2023-07-03T18:09:32Z

Describe the bug
Using libcudf to load a Parquet file with a malformed dictionary does not result in an error.

Steps/Code to reproduce bug
Read the nation.dict-malformed.parquet file from apache/parquet-testing with libcudf and note that it does not error.

Expected behavior
Loading this from Apache Spark or via the parquet-cli tool results in a decode error as expected:

java -cp '/home/jlowe/src/parquet-mr/parquet-cli/target/parquet-cli-1.13.1.jar:/home/jlowe/src/parquet-mr/parquet-cli/target/dependency/*' org.apache.parquet.cli.Main cat parquet-testing/data/nation.dict-malformed.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:86)
	at org.apache.parquet.cli.Main.run(Main.java:163)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.parquet.cli.Main.main(Main.java:193)
Caused by: java.lang.RuntimeException: Failed while reading Parquet file: /home/jlowe/src/spark-rapids/thirdparty/parquet-testing/data/nation.dict-malformed.parquet
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:367)
	at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:344)
	at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:342)
	at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:73)
	... 3 more
Caused by: java.io.EOFException
	at org.apache.parquet.bytes.SingleBufferInputStream.sliceBuffers(SingleBufferInputStream.java:134)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAsBytesInput(ParquetFileReader.java:1688)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1604)
	at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1547)
	at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1157)
	at org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:993)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:940)
	at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:1082)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:130)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
	at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:363)
	... 6 more

The text was updated successfully, but these errors were encountered:

etseidl · 2023-07-17T23:35:42Z

It appears that it's not so much that the dictionary is corrupt, but that the file metadata in the footer does not agree with the sizes in the page headers. For the two columns that use a dictionary, the file metadata gives a size for the column chunk that is 15 bytes smaller than the sum of the sizes from the page headers. One place where this could be detected is in gpuDecodePageHeaders, where it would be possible to detect incrementing the bs->cur pointer beyond bs->end. Communicating that back to the host then becomes the challenge.

etseidl · 2023-09-27T15:16:22Z

#14167 will help with this one

Fixes #13656. Uses the error reporting introduced in #14167 to report errors in header parsing. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #14237

jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Jul 3, 2023

jlowe mentioned this issue Jul 3, 2023

[BUG] Parquet file with malformed dictionary does not error when loaded NVIDIA/spark-rapids#8644

Closed

GregoryKimball added this to libcudf Jul 5, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jul 5, 2023

This was referenced Sep 27, 2023

Propagate errors from Parquet reader kernels back to host #14167

Merged

Detect and report errors in Parquet header parsing #14237

Merged

rapids-bot bot closed this as completed in #14237 Oct 20, 2023

jlowe mentioned this issue Oct 26, 2023

Enable malformed Parquet failure test NVIDIA/spark-rapids#9551

Merged

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

jlowe commented Jul 3, 2023

etseidl commented Jul 17, 2023

etseidl commented Sep 27, 2023

[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

[BUG] libcudf fails to recognized malformed dictionary during Parquet read #13656

Comments

jlowe commented Jul 3, 2023

etseidl commented Jul 17, 2023

etseidl commented Sep 27, 2023