[BUG] NPE on array_max of transformed empty array #5140

jlowe · 2022-04-04T20:49:19Z

Describe the bug
The following query results in an NPE stacktrace when the RAPIDS Accelerator is enabled.

sql("SELECT ARRAY_MAX(TRANSFORM(ARRAY_REPEAT(STRUCT(1, 2), 0), s -> s.col2))").collect

Steps/Code to reproduce bug
Execute the query above with the RAPIDS Accelerator enabled which will result in the following stacktrace:

java.lang.NullPointerException
	at ai.rapids.cudf.HostColumnVectorCore.getInt(HostColumnVectorCore.java:257)
	at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.getInt(RapidsHostColumnVectorCore.java:109)
	at org.apache.spark.sql.vectorized.ColumnarBatchRow.getInt(ColumnarBatch.java:202)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Expected behavior
The query should not crash and produce the same result as on the CPU, e.g.:

scala> sql("SELECT ARRAY_MAX(TRANSFORM(ARRAY_REPEAT(STRUCT(1, 2), 0), s -> s.col2))").collect
res56: Array[org.apache.spark.sql.Row] = Array([null])

Environment details (please complete the following information)
Spark 3.2.1

The text was updated successfully, but these errors were encountered:

gerashegalov · 2022-04-09T05:50:48Z

It seems to boil down to incorrect handling of empty arrays in the array aggregation

from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType(
    [
        StructField('c1', ArrayType(IntegerType(), containsNull=True))
    ]
)
df = spark.createDataFrame(
    [
        [[]]
    ],
    schema
)
df.select(array_max('c1')).collect()

22/04/09 05:45:54 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> array_max(c1#0) AS array_max(c1)#2 will run on GPU
    *Expression <ArrayMax> array_max(c1#0) will run on GPU
  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
    @Expression <AttributeReference> c1#0 could run on GPU

22/04/09 05:45:55 ERROR Executor: Exception in task 15.0 in stage 0.0 (TID 15)6]
Caused by: java.lang.AssertionError: index is out of range 0 <= 0 < 0
 at ai.rapids.cudf.HostColumnVectorCore.isNull(HostColumnVectorCore.java:451)
 at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.isNullAt(RapidsHostColumnVectorCore.java:89)
 at org.apache.spark.sql.vectorized.ColumnarBatchRow.isNullAt(ColumnarBatch.java:190)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)

gerashegalov · 2022-05-03T21:54:53Z

This issue can be (re)solved in cuDF via rapidsai/cudf#10779

…#10779) This PR suggests a 3VL way of interpreting `isNull` for a `rowId` out of bounds. Such a value is unknown and therefore isNull should be `true`. NVIDIA/spark-rapids#5140 shows that `SpecificUnsafeProjection` may probe child columns for NULL even though the parent column row is also NULL. However there are no rows in the child CV when the parent row is NULL leading to an assert violation if asserts are enabled or an NPE if disabled. Signed-off-by: Gera Shegalov <[email protected]> Authors: - Gera Shegalov (https://github.com/gerashegalov) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #10779

gerashegalov · 2022-05-11T16:20:29Z

Closed by rapidsai/cudf#10779 and #5438

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 4, 2022

mattahrens added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Apr 5, 2022

mattahrens assigned gerashegalov Apr 5, 2022

sameerz mentioned this issue Apr 12, 2022

[TASK] Big Reliability Epic #1870

Closed

14 tasks

revans2 added P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed P1 Nice to have for release labels Apr 12, 2022

sameerz added this to the May 2 - May 20 milestone Apr 29, 2022

gerashegalov mentioned this issue May 3, 2022

HostColumnVectoreCore#isNull should return true for out-of-range rows rapidsai/cudf#10779

Merged

gerashegalov linked a pull request May 11, 2022 that will close this issue

Add tests for all-null data for array_max #5438

Merged

gerashegalov closed this as completed May 11, 2022

gerashegalov mentioned this issue May 12, 2022

[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NPE on array_max of transformed empty array #5140

[BUG] NPE on array_max of transformed empty array #5140

jlowe commented Apr 4, 2022

gerashegalov commented Apr 9, 2022 •

edited

Loading

gerashegalov commented May 3, 2022

gerashegalov commented May 11, 2022

[BUG] NPE on array_max of transformed empty array #5140

[BUG] NPE on array_max of transformed empty array #5140

Comments

jlowe commented Apr 4, 2022

gerashegalov commented Apr 9, 2022 • edited Loading

gerashegalov commented May 3, 2022

gerashegalov commented May 11, 2022

gerashegalov commented Apr 9, 2022 •

edited

Loading