You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Recently support for reading binary columns was added to CUDF and Spark (#6161). But the tests did not cover nested types, and the code appears to crash when nested types are read.
Steps/Code to reproduce bug
spark.range(100).selectExpr("CAST(id AS String) as s").selectExpr("CAST(S AS BINARY) as b").selectExpr("struct(b) as st").write.mode("overwrite").parquet("./target/TEST")
spark.read.parquet("./target/TEST").show()
Results in a crash
...
Caused by: java.lang.AssertionError: Type conversion is not allowed from Table{columns=[ColumnVector{rows=100, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 14 7fa1c0dbe310)}], cudfTable=140332701702640, rows=100} to [StructType(StructField(b,BinaryType,true))] columns 0 to 1
at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:776)
at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:657)
at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:787)
at scala.Option.map(Option.scala:230)
at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:787)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
and with asserts disabled it might be even worse...
Expected behavior
No crash no data corruption. We either fall back to the CPU or we do the right thing.
Environment details (please complete the following information)
This looks like it is happening on 22.08 and 22.10
The text was updated successfully, but these errors were encountered:
spark.range(100).selectExpr("CAST(id AS String) as s").selectExpr("CAST(S AS BINARY) as b").selectExpr("array(b) as ab").write.mode("overwrite").parquet("/tmp/test.parquet")
spark.read.parquet("/tmp/test.parquet").show()
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.getBinary(RapidsHostColumnVectorCore.java:188)
at org.apache.spark.sql.vectorized.ColumnarArray.getBinary(ColumnarArray.java:153)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Describe the bug
Recently support for reading binary columns was added to CUDF and Spark (#6161). But the tests did not cover nested types, and the code appears to crash when nested types are read.
Steps/Code to reproduce bug
Results in a crash
and with asserts disabled it might be even worse...
Expected behavior
No crash no data corruption. We either fall back to the CPU or we do the right thing.
Environment details (please complete the following information)
This looks like it is happening on 22.08 and 22.10
The text was updated successfully, but these errors were encountered: