Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The ORC output data of a query is not readable #1550

Closed
wjxiz1992 opened this issue Jan 19, 2021 · 6 comments · Fixed by #2084
Closed

[BUG] The ORC output data of a query is not readable #1550

wjxiz1992 opened this issue Jan 19, 2021 · 6 comments · Fixed by #2084
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@wjxiz1992
Copy link
Collaborator

Describe the bug
When reading(use some Dataframe APIs to operate on the ORC data) the orc ouput produced by the plugin, there's an error:

Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 5 kind LENGTH position: 23 length: 23 range: 0 offset: 2762004 limit: 2762004 range 0 = 0 to 23 uncompressed: 20 to 20
        at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
        at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
        at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:1778)
        at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:1758)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1500)
        at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.startStripe(TreeReaderFactory.java:2090)
        at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105)
        at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1254)
        at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1289)
        at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:286)
        at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:669)
        at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
        at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$1(OrcFileFormat.scala:216)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Steps/Code to reproduce bug
the output is produced by an LHA query, but I think it's safe to only point where it is on our egx machines: spark-egx-02:/home/allxu/q0_out_gpu.

$SPARK_HOME/bin/spark-shell

scala> val df = spark.read.orc("q0_out_gpu")
df: org.apache.spark.sql.DataFrame = [lx_id: bigint, lx_name: string ... 6 more fields]

scala> df.write.parquet("q0_convert_parquet")
21/01/20 00:03:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 6)
java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 5 kind LENGTH position: 23 length: 23 range: 0 offset: 2744101 limit: 2744101 range 0 = 0 to 23 uncompressed: 20 to 20
...
...

Expected behavior
No error should be seen.

Environment details (please complete the following information)

  • Environment location: Standalone

Additional context
It's a query from LHA, please reach me if I need to provide more information about it.

@wjxiz1992 wjxiz1992 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 19, 2021
@wjxiz1992 wjxiz1992 changed the title [BUG] The ORC output data of a query is not 100% readable [BUG] The ORC output data of a query is not readable Jan 19, 2021
@jlowe
Copy link
Member

jlowe commented Jan 19, 2021

@wjxiz1992 is it possible to produce the same output but in a Parquet file? It would be good to know what data is supposed to be in the corrupted ORC file and see if we can reproduce the bad ORC file when loading the equivalent Parquet file then writing it to an ORC file just with libcudf.

@wjxiz1992
Copy link
Collaborator Author

wjxiz1992 commented Jan 20, 2021

@wjxiz1992 is it possible to produce the same output but in a Parquet file? It would be good to know what data is supposed to be in the corrupted ORC file and see if we can reproduce the bad ORC file when loading the equivalent Parquet file then writing it to an ORC file just with libcudf.

Yes, I've put the parquet output (produced also by GPU) to spark-egx-02:/home/allxu/q0_out_gpu_parquet.
One more question about "just with libcudf": to me, libcudf is only the Java API for cuDF. So you mean I can create a java project and then use those APIs to do the job, so that it's "just with libcudf" ?

@jlowe
Copy link
Member

jlowe commented Jan 20, 2021

By "just with libcudf" I meant isolating the issue by removing Spark from the equation. For example, just using the cudf APIs directly, e.g.: something like this from the Spark shell REPL:

val t = ai.rapids.cudf.Table.readParquet(new java.io.File("/tmp/data.parquet"))
t.writeORC(new java.io.File("/tmp/data.orc"))

and verify that the ORC file can be read by Spark CPU and looks correct relative to the Parquet file.

I've checked this, and the Parquet file does not replicate the corrupted ORC file either when writing with Spark GPU nor when using the cudf APIs directly. So either the corruption problem is sensitive to the ordering of the data (the ORC and Parquet files are ordered quite differently) or it's some other issue (e.g.: a race condition).

I noticed in the bad ORC file that one column in particular, a string column, is unreadable due to the corruption. The other columns are all readable by Spark CPU, however the data in another string column isn't completely correct. The first row has corrupted data relative to the Parquet file but many other rows are correct. So the nature of the corruption isn't completely isolated to just the one string column.

Does this issue happen every time when the query is run in this cluster?

@wjxiz1992
Copy link
Collaborator Author

By "just with libcudf" I meant isolating the issue by removing Spark from the equation. For example, just using the cudf APIs directly, e.g.: something like this from the Spark shell REPL:

val t = ai.rapids.cudf.Table.readParquet(new java.io.File("/tmp/data.parquet"))
t.writeORC(new java.io.File("/tmp/data.orc"))

and verify that the ORC file can be read by Spark CPU and looks correct relative to the Parquet file.

I've checked this, and the Parquet file does not replicate the corrupted ORC file either when writing with Spark GPU nor when using the cudf APIs directly. So either the corruption problem is sensitive to the ordering of the data (the ORC and Parquet files are ordered quite differently) or it's some other issue (e.g.: a race condition).

I noticed in the bad ORC file that one column in particular, a string column, is unreadable due to the corruption. The other columns are all readable by Spark CPU, however the data in another string column isn't completely correct. The first row has corrupted data relative to the Parquet file but many other rows are correct. So the nature of the corruption isn't completely isolated to just the one string column.

Does this issue happen every time when the query is run in this cluster?

Thanks for the explanation!
Yes this issue happen every single time in both my local PC and NGC node. Both cases use standalone mode.
This query is run with a config of only 1 executor.

@revans2
Copy link
Collaborator

revans2 commented Feb 12, 2021

cudf 0.18 has already shipped so we cannot fix this in the 0.4 release so I am moving this to 0.5 and filed #1722 to mitigate the issue in the 0.4 release.

@wjxiz1992
Copy link
Collaborator Author

@revans2 This has been fixed after rapidsai/cudf#7565, I tested with latest 0.5 plugin jar and 0.19 cuDF jar.
Shall we update the config doc and turn on the orc write switch?

wjxiz1992 added a commit to wjxiz1992/spark-rapids that referenced this issue Apr 6, 2021
Re-enable the orc write since NVIDIA#1550
has been fixed

Signed-off-by: Allen Xu <[email protected]>
@sameerz sameerz added this to the Mar 30 - Apr 9 milestone Apr 6, 2021
jlowe pushed a commit that referenced this issue Apr 7, 2021
Re-enable the orc write since #1550
has been fixed

Signed-off-by: Allen Xu <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Re-enable the orc write since NVIDIA#1550
has been fixed

Signed-off-by: Allen Xu <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Re-enable the orc write since NVIDIA#1550
has been fixed

Signed-off-by: Allen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants