[BUG] Max() can return wrong results when column consists of Nans #87

kuhushukla · 2020-06-01T18:34:13Z

Describe the bug
Max() on a column with NaN values , with or without a grouping key which has NaNs generates wrong results on the GPU.
Steps/Code to reproduce bug

//GPU plugin on
val rdd = sc.parallelize(Seq(Float.NaN, 1.0, 2.0), 2)
val df = rdd.toDF("c0")
val res= df.agg(max("c0"))
res.collect
 res.explain

Output:

res1: Array[org.apache.spark.sql.Row] = Array([2.0])
== Physical Plan ==
*(2) GpuColumnarToRow false
+- GpuHashAggregate(keys=[], functions=[gpumax(c0#17)]), filters=List(None))
+- GpuCoalesceBatches TargetSize(2147483647)
+- GpuColumnarExchange gpusinglepartitioning(), true, [id=#75]
+- GpuHashAggregate(keys=[], functions=[partial_gpumax(c0#17)]), filters=List(None))
+- !GpuProject [value#14 AS c0#17]
+- GpuRowToColumnar TargetSize(2147483647)
+- *(1) SerializeFromObject [input[0, double, false] AS value#14]
+- Scan[obj#13]

//GPU Plugin off
 spark.conf.set("spark.rapids.sql.enabled", "false")
val res= df.agg(max("c0"))
scala> res.collect
 res.explain

Output:

res5: Array[org.apache.spark.sql.Row] = Array([NaN])

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[max(c0#17)])
+- Exchange SinglePartition, true, [id=#106]
+- *(1) HashAggregate(keys=[], functions=[partial_max(c0#17)])
+- *(1) Project [value#14 AS c0#17]
+- *(1) SerializeFromObject [input[0, double, false] AS value#14]
+- Scan[obj#13]

Expected behavior
CPU and GPU answers should match.

Additional context
rapidsai/cudf#4753 is related on cudf side.

The text was updated successfully, but these errors were encountered:

revans2 · 2020-06-02T16:37:29Z

This looks like an issue that we cannot work around without the help of cudf. Because float aggregations are off by default I think in the short term if we just update the documentation we can live with this. But we should file a cudf specific issue to try and address this properly.

Signed-off-by: spark-rapids automation <[email protected]>

jlowe · 2024-02-14T16:37:50Z

This has been fixed.

kuhushukla added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 1, 2020

kuhushukla mentioned this issue Jun 2, 2020

[FEA] Groupby MIN/MAX with NaN values does not match what Spark expects rapidsai/cudf#4753

Open

revans2 added SQL part of the SQL/Dataframe plugin and removed ? - Needs Triage Need team to review and classify labels Jun 2, 2020

sameerz added the P2 Not required for release label Aug 25, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 4c9ef51 (NVIDIA#87)

027b3fe

Signed-off-by: spark-rapids automation <[email protected]>

jlowe closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Max() can return wrong results when column consists of Nans #87

[BUG] Max() can return wrong results when column consists of Nans #87

kuhushukla commented Jun 1, 2020

revans2 commented Jun 2, 2020

jlowe commented Feb 14, 2024

[BUG] Max() can return wrong results when column consists of Nans #87

[BUG] Max() can return wrong results when column consists of Nans #87

Comments

kuhushukla commented Jun 1, 2020

revans2 commented Jun 2, 2020

jlowe commented Feb 14, 2024