-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed to read Parquet file generated by GPU-enabled Spark. #347
Comments
I was able to reproduce the error. I'll try to dig into what is happening now. |
So when I use the official parquet tools on the file produced I get an error like
Specifically it appears to be with
This looks like it is a cudf related bug, so I will file another issue there. |
I filed the above issue in CUDF for this. Once it is fixed I will close this one. |
The underlying CUDF issue was fixed in 0.15. When I get a change I will try to verify that it is fixed for this use case, and if so then I will close this issue for the 0.2.0 release that will be based off of cudf-0.15 |
I was able to manually verify that this is fixed now. |
Signed-off-by: spark-rapids automation <[email protected]>
Describe the bug
An error below happened when I tried to read parquet file.
org.apache.parquet.io.ParquetDecodingException: Failed to read from input stream
The parquet file was generated by GPU-enabled Spark, and I tried to read it without GPU. When I used Spark with GPU, then there's no error.
Related logs including stacktrace are below.
Steps/Code to reproduce bug
wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2001.tgz
tar xf mortgage_2000-2001.tgz
cd mortgage_2000-2001/perf
head -13000 Performance_2000Q1.txt > Performance_2000Q1_13k.txt
spark-shell
with GPU:spark-shell --master local[*] --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.executor.memory=32g --conf spark.driver.memory=32g --conf spark.driver.memoryOverhead=4g --conf spark.executor.memoryOverhead=4g --num-executors 1 --conf spark.executor.cores=1 --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.rapids.memory.pinnedPool.size=2G --conf spark.locality.wait=0s --conf spark.sql.files.maxPartitionBytes=512m --conf spark.sql.shuffle.partitions=10 --conf spark.plugins=com.nvidia.spark.SQLPlugin
spark-shell
without GPU:spark-shell --master local[*] --conf spark.executor.memory=32g --conf spark.driver.memory=32g --conf spark.driver.memoryOverhead=4g --conf spark.executor.memoryOverhead=4g
Expected behavior
No error happens when reading converted parquet file.
Environment details (please complete the following information)
wget https://ftp.jaist.ac.jp/pub/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04
echo ${SPARK_CUDF_JAR}
=>/ws/rapids/jars//cudf-0.14-cuda10-2.jar
echo ${SPARK_RAPIDS_PLUGIN_JAR}
=>/ws/rapids/jars//rapids-4-spark_2.12-0.1.0.jar
Ubuntu 18.04.3
Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
TITAN X (Pascal)
Additional context
N/A
The text was updated successfully, but these errors were encountered: