[BUG] Mismatched number of columns while performing `GpuSort` #2368

abellina · 2021-05-07T16:18:25Z

We are seeing an exception on 9 tests of the integration suite when executing in a distributed fashion (a box with 4 T4s), and we run with UCX. I haven't run without UCX in this environment, so I am not entirely sure if it's UCX specific or just the distributed part.

The tests that failed are:

05:18:07  FAILED ../../src/main/python/repart_test.py::test_round_robin_sort_fallback[[('a', Struct(('a_1', Struct(('a_1_1', Integer)))))]][IGNORE_ORDER({'local': True}), ALLOW_NON_GPU(ShuffleExchangeExec,RoundRobinPartitioning)]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]

The exception was in GpuSort while in upperBound:

Caused by: ai.rapids.cudf.CudfException: cuDF failure at: row_operators.cuh:361: Mismatched number of columns.
    at ai.rapids.cudf.Table.bound(Native Method)
    at ai.rapids.cudf.Table.upperBound(Table.java:1409)
    at ai.rapids.cudf.Table.upperBound(Table.java:1434)
    at com.nvidia.spark.rapids.GpuSorter.$anonfun$upperBound$2(SortUtils.scala:168)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuSorter.withResource(SortUtils.scala:65)
    at com.nvidia.spark.rapids.GpuSorter.$anonfun$upperBound$1(SortUtils.scala:167)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuSorter.withResource(SortUtils.scala:65)
    at com.nvidia.spark.rapids.GpuSorter.upperBound(SortUtils.scala:166)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$4(GpuRangePartitioner.scala:187)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$3(GpuRangePartitioner.scala:186)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$2(GpuRangePartitioner.scala:184)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$1(GpuRangePartitioner.scala:182)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.computeBoundsAndClose(GpuRangePartitioner.scala:180)
    at com.nvidia.spark.rapids.GpuRangePartitioner.columnarEval(GpuRangePartitioner.scala:201)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.$anonfun$prepareBatchShuffleDependency$3(GpuShuffleExchangeExec.scala:205)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:226)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:237)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at org.apache.spark.sql.rapids.RapidsCachingWriter.write(RapidsShuffleInternalManagerBase.scala:97)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

The text was updated successfully, but these errors were encountered:

revans2 · 2021-05-07T18:54:07Z

I was able to reproduce this locally and it is starting to look like a bug in cudf. I went in and started to print out the types for partitions that failed and for partitions that passed.

GOOD: FIND IN TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
GOOD: FIND TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
GOOD: ORDERING WrappedArray(ORDER BY 0 ASC NULL SMALLEST)
ERR: FIND IN TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
ERR: FIND TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
ERR: ORDERING WrappedArray(ORDER BY 0 ASC NULL SMALLEST)

And the schema is the same for both cases. I'll try to see if there is something special about the data in the failure cases too.

jlowe · 2021-05-07T19:33:38Z

I think you're right, @revans2. There's an early check in cudf::upper_bound that the column counts match, and we're not failing there so the column counts are getting off somewhere within the implementation.

I suspect it has to do with the nullability or presence of nulls in the struct column, as it is flattening the struct columns and may or may not generate a BOOL8 column for the parent struct's validity. I think that could lead to a column count difference if one side has null structs and the other does not.

revans2 · 2021-05-07T19:37:29Z

I have it down to a very small use case. My guess is that in one table we have a null struct and in another table we don't. Be aware that I saw some other scary errors while trying to debug this, so I will be digging a bit deeper.

DEBUG ERR: FIND IN TBL TABLE: {{STRUCT: INT32}} Table{columns=[ColumnVector{rows=8, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 59 7fd3ec18c530)}], cudfTable=140548175888736, rows=8}
COLUMN 0 - STRUCT
COLUMN 0:CHILD_0 - INT32
0 -2087175000
1 -1028677693
2 173016769
3 887174525
4 1578841898
5 1591161832
6 1859007115
7 2008070098
DEBUG ERR: FIND TBL TABLE: {{STRUCT: INT32}} Table{columns=[ColumnVector{rows=31, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 68 7fd3ec18fcf0)}], cudfTable=140548175876800, rows=31}
COLUMN 0 - STRUCT
0 NULL
COLUMN 0:CHILD_0 - INT32
0 NULL
1 NULL
2 -2087175000
3 -1345486411
4 -1028677693
5 -341142443
6 -282729822
7 -236270331
8 -100798596
9 -100515324
10 20129910
11 173016769
12 324918642
13 326215455
14 558978567
15 775624428
16 790204970
17 848988934
18 887174525
19 994238642
20 1259821658
21 1265224248
22 1338435109
23 1578841898
24 1591161832
25 1618216427
26 1727289611
27 1859007115
28 1931725428
29 2008070098
30 2063803297
ERR: ORDERING WrappedArray(ORDER BY 0 ASC NULL SMALLEST)

revans2 · 2021-05-07T19:43:57Z

Yup they flatten the struct columns independent of each other so if there is a mismatch we get the wrong number of columns/possibly the wrong types too.

revans2 · 2021-05-07T19:56:24Z

I filed rapidsai/cudf#8187 for this in cudf. But I am going to keep digging because I saw some other scarier behavior where I was trying to get a very small repro case we were getting the wrong number of columns out instead of an exception.

sameerz · 2021-05-11T20:21:44Z

@abellina have we seen this fail since the cudf fix?

abellina · 2021-05-11T20:24:47Z

This can be closed. It is fixed by @ttnghia's change.

gerashegalov · 2021-05-26T06:00:19Z

This issue can be reproduced locally with #2507

TEST_PARALLEL=1 \
SPARK_HOME=~/dist/spark-3.1.1-bin-hadoop3.2 \
./integration_tests/run_pyspark_from_build.sh -k 'test_large_orderby and Struct'

revans2 · 2021-05-26T10:58:30Z

This issue can be reproduced locally with #2507

TEST_PARALLEL=1 \
SPARK_HOME=~/dist/spark-3.1.1-bin-hadoop3.2 \
./integration_tests/run_pyspark_from_build.sh -k 'test_large_orderby and Struct'

The other important part is to be able to match the same number of tasks. How many tasks did this run with, typically the number of CPU threads that your processor has?

abellina · 2021-05-26T13:29:02Z

I ran with @gerashegalov's branch locally with 12 tasks (that's the number of cores I have), for test_large_orderby[Struct(('child0', Long)):

../../src/main/python/sort_test.py::test_large_orderby[Struct(('child0', Long))] 21/05/26 13:22:36 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 66)
ai.rapids.cudf.CudfException: cuDF failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-324-cuda11/cpp/include/cudf/table/row_operators.cuh:361: Mismatched number of columns.
	at ai.rapids.cudf.Table.bound(Native Method)
	at ai.rapids.cudf.Table.upperBound(Table.java:1409)
	at ai.rapids.cudf.Table.upperBound(Table.java:1434)
	at com.nvidia.spark.rapids.GpuSorter.$anonfun$upperBound$2(SortUtils.scala:168)

These other tests passed:

../../src/main/python/sort_test.py::test_large_orderby_stable[Struct(('child0', Long))] PASSED [ 66%]
../../src/main/python/sort_test.py::test_large_orderby_nested_ridealong[Struct(('child1', Byte))] PASSED [100%]

revans2 · 2021-05-26T13:35:09Z

Someone needs to get a reproducible use case filed against cudf. Also at this point the code freeze is today.

@sameerz the only option right now without a fix from cudf is to disable sorting for struct columns entirely. So we either need to do that or push for cudf to not freeze while we fix this.

revans2 · 2021-05-26T14:26:16Z

I will get a repro case for cudf and file an issue

abellina · 2021-05-26T14:33:25Z

@revans2 sorry, I was looking into it, we synced, I'll take over and file an issue.

abellina · 2021-05-26T16:06:27Z

Seeing it here https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SortUtils.scala#L168, where both findInTable and findTable are struct columns, and have 0 nulls. Getting CUDF C++ build going to add a test.

One interesting thing is that the single row in findTbl is not in findInTbl.

FIND IN TBL => Table{columns=[ColumnVector{rows=1024, type=STRUCT, nullCount=Optional[0], offHeap=(ID: 1646 7fa9bc0d8440)}], cudfTable=140366980782704, rows=1024}
FIND TBL: Table{columns=[ColumnVector{rows=1, type=STRUCT, nullCount=Optional[0], offHeap=(ID: 1651 7fa9bc0d8340)}], cudfTable=140366981198240, rows=1}

gerashegalov · 2021-05-26T17:27:31Z

@revans2 I ran out of time yesterday to look deeper for even smaller , potentaill/y cudf-only repro , for this issue. I first encountered it with NUM_LOCAL_EXECS=2 CORES_PER_EXEC=6 but then consistently reproduced it with the single-JVM local mode with the command provided. My desktop has 12 cores.

abellina · 2021-05-27T16:08:09Z

Merged @gerashegalov's change with the added test, which is now passing CI.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels May 7, 2021

revans2 self-assigned this May 7, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label May 11, 2021

abellina closed this as completed May 11, 2021

gerashegalov reopened this May 26, 2021

gerashegalov mentioned this issue May 27, 2021

[FEA] Enable inspecting integration test jobs in Spark History Server #2518

Closed

abellina closed this as completed May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Mismatched number of columns while performing `GpuSort` #2368

[BUG] Mismatched number of columns while performing `GpuSort` #2368

abellina commented May 7, 2021

revans2 commented May 7, 2021

jlowe commented May 7, 2021

revans2 commented May 7, 2021

revans2 commented May 7, 2021

revans2 commented May 7, 2021

sameerz commented May 11, 2021

abellina commented May 11, 2021

gerashegalov commented May 26, 2021

revans2 commented May 26, 2021

abellina commented May 26, 2021

revans2 commented May 26, 2021

revans2 commented May 26, 2021

abellina commented May 26, 2021

abellina commented May 26, 2021

gerashegalov commented May 26, 2021

abellina commented May 27, 2021

[BUG] Mismatched number of columns while performing GpuSort #2368

[BUG] Mismatched number of columns while performing GpuSort #2368

Comments

abellina commented May 7, 2021

revans2 commented May 7, 2021

jlowe commented May 7, 2021

revans2 commented May 7, 2021

revans2 commented May 7, 2021

revans2 commented May 7, 2021

sameerz commented May 11, 2021

abellina commented May 11, 2021

gerashegalov commented May 26, 2021

revans2 commented May 26, 2021

abellina commented May 26, 2021

revans2 commented May 26, 2021

revans2 commented May 26, 2021

abellina commented May 26, 2021

abellina commented May 26, 2021

gerashegalov commented May 26, 2021

abellina commented May 27, 2021

[BUG] Mismatched number of columns while performing `GpuSort` #2368

[BUG] Mismatched number of columns while performing `GpuSort` #2368