Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NPE on array_max of transformed empty array #5140

Closed
jlowe opened this issue Apr 4, 2022 · 3 comments · Fixed by #5438
Closed

[BUG] NPE on array_max of transformed empty array #5140

jlowe opened this issue Apr 4, 2022 · 3 comments · Fixed by #5438
Assignees
Labels
bug Something isn't working P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@jlowe
Copy link
Member

jlowe commented Apr 4, 2022

Describe the bug
The following query results in an NPE stacktrace when the RAPIDS Accelerator is enabled.

sql("SELECT ARRAY_MAX(TRANSFORM(ARRAY_REPEAT(STRUCT(1, 2), 0), s -> s.col2))").collect

Steps/Code to reproduce bug
Execute the query above with the RAPIDS Accelerator enabled which will result in the following stacktrace:

java.lang.NullPointerException
	at ai.rapids.cudf.HostColumnVectorCore.getInt(HostColumnVectorCore.java:257)
	at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.getInt(RapidsHostColumnVectorCore.java:109)
	at org.apache.spark.sql.vectorized.ColumnarBatchRow.getInt(ColumnarBatch.java:202)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Expected behavior
The query should not crash and produce the same result as on the CPU, e.g.:

scala> sql("SELECT ARRAY_MAX(TRANSFORM(ARRAY_REPEAT(STRUCT(1, 2), 0), s -> s.col2))").collect
res56: Array[org.apache.spark.sql.Row] = Array([null])

Environment details (please complete the following information)
Spark 3.2.1

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 4, 2022
@mattahrens mattahrens added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Apr 5, 2022
@gerashegalov
Copy link
Collaborator

gerashegalov commented Apr 9, 2022

It seems to boil down to incorrect handling of empty arrays in the array aggregation

from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType(
    [
        StructField('c1', ArrayType(IntegerType(), containsNull=True))
    ]
)
df = spark.createDataFrame(
    [
        [[]]
    ],
    schema
)
df.select(array_max('c1')).collect()

22/04/09 05:45:54 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> array_max(c1#0) AS array_max(c1)#2 will run on GPU
    *Expression <ArrayMax> array_max(c1#0) will run on GPU
  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
    @Expression <AttributeReference> c1#0 could run on GPU

22/04/09 05:45:55 ERROR Executor: Exception in task 15.0 in stage 0.0 (TID 15)6]
Caused by: java.lang.AssertionError: index is out of range 0 <= 0 < 0
 at ai.rapids.cudf.HostColumnVectorCore.isNull(HostColumnVectorCore.java:451)
 at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.isNullAt(RapidsHostColumnVectorCore.java:89)
 at org.apache.spark.sql.vectorized.ColumnarBatchRow.isNullAt(ColumnarBatch.java:190)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:350)

@revans2 revans2 added P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed P1 Nice to have for release labels Apr 12, 2022
@sameerz sameerz added this to the May 2 - May 20 milestone Apr 29, 2022
@gerashegalov
Copy link
Collaborator

This issue can be (re)solved in cuDF via rapidsai/cudf#10779

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue May 6, 2022
…#10779)

This PR suggests a 3VL way of interpreting `isNull` for a `rowId` out of bounds. Such a value is unknown and therefore isNull should be `true`. 

NVIDIA/spark-rapids#5140 shows that `SpecificUnsafeProjection` may probe child columns for NULL even though the parent column row is also NULL. 

However there are no rows in the child CV when the parent row is NULL leading to an assert violation if asserts are enabled or an NPE if disabled.

Signed-off-by: Gera Shegalov <[email protected]>

Authors:
  - Gera Shegalov (https://github.com/gerashegalov)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #10779
@gerashegalov gerashegalov linked a pull request May 11, 2022 that will close this issue
@gerashegalov
Copy link
Collaborator

Closed by rapidsai/cudf#10779 and #5438

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants