Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

Closed
lazykyama opened this issue Jul 13, 2020 · 5 comments
Closed

[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347

lazykyama opened this issue Jul 13, 2020 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@lazykyama
Copy link

Describe the bug

An error below happened when I tried to read parquet file.
org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
The parquet file was generated by GPU-enabled Spark, and I tried to read it without GPU. When I used Spark with GPU, then there's no error.

Related logs including stacktrace are below.

20/07/10 03:42:41 WARN BlockManager: Putting block rdd_5_0 failed due to exception org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream.
20/07/10 03:42:41 WARN BlockManager: Block rdd_5_0 could not be removed as it was not found on disk or in memory
20/07/10 03:42:41 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:728)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readInteger(VectorizedRleValuesReader.java:146)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readIntegers(VectorizedRleValuesReader.java:549)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:228)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
        at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:490)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.next(InMemoryRelation.scala:98)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anon$1.next(InMemoryRelation.scala:90)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at org.apache.parquet.bytes.SingleBufferInputStream.read(SingleBufferInputStream.java:52)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readUnsignedVarInt(VectorizedRleValuesReader.java:647)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readNextGroup(VectorizedRleValuesReader.java:701)
        ... 42 more

Steps/Code to reproduce bug

  • Data preparation
    • Download Mortgage dataset from RAPIDS's official repo.
      • e.g.,
        • wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2001.tgz
        • tar xf mortgage_2000-2001.tgz
      • Note that this issue could be reproduced by at least 13k records. When using 12k records, the issue couldn't happen.
      • The example commands that I used to reduce the size of whole dataset are below.
        • cd mortgage_2000-2001/perf
        • head -13000 Performance_2000Q1.txt > Performance_2000Q1_13k.txt
  • Converting from CSV to parquet
    • Launching spark-shell with GPU: spark-shell --master local[*] --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.executor.memory=32g --conf spark.driver.memory=32g --conf spark.driver.memoryOverhead=4g --conf spark.executor.memoryOverhead=4g --num-executors 1 --conf spark.executor.cores=1 --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.rapids.memory.pinnedPool.size=2G --conf spark.locality.wait=0s --conf spark.sql.files.maxPartitionBytes=512m --conf spark.sql.shuffle.partitions=10 --conf spark.plugins=com.nvidia.spark.SQLPlugin
    • Running scripts: https://gist.github.com/lazykyama/1b6831d4b7b6381ed4e2c9348f55aa5d#file-convert-scala
  • Reading parquet

Expected behavior

No error happens when reading converted parquet file.

Environment details (please complete the following information)

  • Environment location: Standalone
  • Spark configuration settings related to the issue
    • Spark binaries are donwloaded by wget https://ftp.jaist.ac.jp/pub/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
    • Spark itself is running on RAPIDS docker container: nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04
    • echo ${SPARK_CUDF_JAR} => /ws/rapids/jars//cudf-0.14-cuda10-2.jar
    • echo ${SPARK_RAPIDS_PLUGIN_JAR} => /ws/rapids/jars//rapids-4-spark_2.12-0.1.0.jar
    • Other env info:
      • OS: Ubuntu 18.04.3
      • CPU: Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
      • GPU: TITAN X (Pascal)

Additional context

N/A

@lazykyama lazykyama added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 13, 2020
@revans2 revans2 self-assigned this Jul 13, 2020
@revans2
Copy link
Collaborator

revans2 commented Jul 13, 2020

I was able to reproduce the error. I'll try to dig into what is happening now.

@revans2
Copy link
Collaborator

revans2 commented Jul 13, 2020

So when I use the official parquet tools on the file produced I get an error like

org.apache.parquet.io.ParquetDecodingException: Can not read value at 11809 in block 0 in file file:/home/roberte/src/rapids-plugin-4-spark/target/tmp.parquet/part-00000-9f03a80a-33f3-49ce-b796-69f9c90d5694-c000.snappy.parquet

Specifically it appears to be with

DOUBLE non_interest_bearing_upb 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 13000 *** 

This looks like it is a cudf related bug, so I will file another issue there.

@revans2
Copy link
Collaborator

revans2 commented Jul 13, 2020

I filed the above issue in CUDF for this. Once it is fixed I will close this one.

@revans2
Copy link
Collaborator

revans2 commented Jul 30, 2020

The underlying CUDF issue was fixed in 0.15. When I get a change I will try to verify that it is fixed for this use case, and if so then I will close this issue for the 0.2.0 release that will be based off of cudf-0.15

@revans2 revans2 added this to the Jul 20 - Jul 31 milestone Jul 30, 2020
@revans2
Copy link
Collaborator

revans2 commented Jul 30, 2020

I was able to manually verify that this is fixed now.

@revans2 revans2 closed this as completed Jul 30, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 2, 2020
pxLi pushed a commit to pxLi/spark-rapids that referenced this issue May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants