[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

lazykyama · 2020-07-13T04:34:22Z

Describe the bug

An error below happened when I tried to read parquet file.
org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
The parquet file was generated by GPU-enabled Spark, and I tried to read it without GPU. When I used Spark with GPU, then there's no error.

Related logs including stacktrace are below.

20/07/10 03:42:41 WARN BlockManager: Putting block rdd_5_0 failed due to exception org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream.
20/07/10 03:42:41 WARN BlockManager: Block rdd_5_0 could not be removed as it was not found on disk or in memory
20/07/10 03:42:41 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:728)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readInteger(VectorizedRleValuesReader.java:146)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:549)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:228)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
        at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:490)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.next(InMemoryRelation.scala:98)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.next(InMemoryRelation.scala:90)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at org.apache.parquet.bytes.SingleBufferInputStream.read(SingleBufferInputStream.java:52)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readUnsignedVarInt(VectorizedRleValuesReader.java:647)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:701)
        ... 42 more

Steps/Code to reproduce bug

Data preparation
- Download Mortgage dataset from RAPIDS's official repo.
  - e.g.,
    - wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2001.tgz
    - tar xf mortgage_2000-2001.tgz
  - Note that this issue could be reproduced by at least 13k records. When using 12k records, the issue couldn't happen.
  - The example commands that I used to reduce the size of whole dataset are below.
    - cd mortgage_2000-2001/perf
    - head -13000 Performance_2000Q1.txt > Performance_2000Q1_13k.txt
Converting from CSV to parquet
- Launching spark-shell with GPU: spark-shell --master local[*] --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.executor.memory=32g --conf spark.driver.memory=32g --conf spark.driver.memoryOverhead=4g --conf spark.executor.memoryOverhead=4g --num-executors 1 --conf spark.executor.cores=1 --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.rapids.memory.pinnedPool.size=2G --conf spark.locality.wait=0s --conf spark.sql.files.maxPartitionBytes=512m --conf spark.sql.shuffle.partitions=10 --conf spark.plugins=com.nvidia.spark.SQLPlugin
- Running scripts: https://gist.github.com/lazykyama/1b6831d4b7b6381ed4e2c9348f55aa5d#file-convert-scala
Reading parquet
- Launching spark-shell without GPU: spark-shell --master local[*] --conf spark.executor.memory=32g --conf spark.driver.memory=32g --conf spark.driver.memoryOverhead=4g --conf spark.executor.memoryOverhead=4g
- Running scripts: https://gist.github.com/lazykyama/1b6831d4b7b6381ed4e2c9348f55aa5d#file-read-scala

Expected behavior

No error happens when reading converted parquet file.

Environment details (please complete the following information)

Environment location: Standalone
Spark configuration settings related to the issue
- Spark binaries are donwloaded by wget https://ftp.jaist.ac.jp/pub/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
- Spark itself is running on RAPIDS docker container: nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04
- echo ${SPARK_CUDF_JAR} => /ws/rapids/jars//cudf-0.14-cuda10-2.jar
- echo ${SPARK_RAPIDS_PLUGIN_JAR} => /ws/rapids/jars//rapids-4-spark_2.12-0.1.0.jar
- Other env info:
  - OS: Ubuntu 18.04.3
  - CPU: Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
  - GPU: TITAN X (Pascal)

Additional context

N/A

The text was updated successfully, but these errors were encountered:

revans2 · 2020-07-13T14:54:46Z

I was able to reproduce the error. I'll try to dig into what is happening now.

revans2 · 2020-07-13T15:06:53Z

So when I use the official parquet tools on the file produced I get an error like

org.apache.parquet.io.ParquetDecodingException: Can not read value at 11809 in block 0 in file file:/home/roberte/src/rapids-plugin-4-spark/target/tmp.parquet/part-00000-9f03a80a-33f3-49ce-b796-69f9c90d5694-c000.snappy.parquet

Specifically it appears to be with

DOUBLE non_interest_bearing_upb 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 13000 ***

This looks like it is a cudf related bug, so I will file another issue there.

revans2 · 2020-07-13T15:29:08Z

I filed the above issue in CUDF for this. Once it is fixed I will close this one.

revans2 · 2020-07-30T14:07:31Z

The underlying CUDF issue was fixed in 0.15. When I get a change I will try to verify that it is fixed for this use case, and if so then I will close this issue for the 0.2.0 release that will be based off of cudf-0.15

revans2 · 2020-07-30T16:37:43Z

I was able to manually verify that this is fixed now.

Signed-off-by: spark-rapids automation <[email protected]>

lazykyama added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 13, 2020

revans2 self-assigned this Jul 13, 2020

revans2 mentioned this issue Jul 13, 2020

[BUG] Parquet writer produces results that cannot be read by the CPU rapidsai/cudf#5677

Closed

revans2 added this to the Jul 20 - Jul 31 milestone Jul 30, 2020

revans2 closed this as completed Jul 30, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Aug 2, 2020

pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022

Improve fed_(server|client).json readability (NVIDIA#347)

fee0049

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to e762e99 (NVIDIA#347)

b659bb2

Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

lazykyama commented Jul 13, 2020

revans2 commented Jul 13, 2020

revans2 commented Jul 13, 2020

revans2 commented Jul 13, 2020

revans2 commented Jul 30, 2020

revans2 commented Jul 30, 2020

[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

Comments

lazykyama commented Jul 13, 2020

revans2 commented Jul 13, 2020

revans2 commented Jul 13, 2020

revans2 commented Jul 13, 2020

revans2 commented Jul 30, 2020

revans2 commented Jul 30, 2020