Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Mismatched number of columns while performing GpuSort #2368

Closed
abellina opened this issue May 7, 2021 · 16 comments
Closed

[BUG] Mismatched number of columns while performing GpuSort #2368

abellina opened this issue May 7, 2021 · 16 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@abellina
Copy link
Collaborator

abellina commented May 7, 2021

We are seeing an exception on 9 tests of the integration suite when executing in a distributed fashion (a box with 4 T4s), and we run with UCX. I haven't run without UCX in this environment, so I am not entirely sure if it's UCX specific or just the distributed part.

The tests that failed are:

05:18:07  FAILED ../../src/main/python/repart_test.py::test_round_robin_sort_fallback[[('a', Struct(('a_1', Struct(('a_1_1', Integer)))))]][IGNORE_ORDER({'local': True}), ALLOW_NON_GPU(ShuffleExchangeExec,RoundRobinPartitioning)]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a ASC NULLS FIRST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>0-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-True-200]
05:18:07  FAILED ../../src/main/python/sort_test.py::test_single_nested_orderby_plain[Column<b'a DESC NULLS LAST'>1-Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null])-False-200]

The exception was in GpuSort while in upperBound:

Caused by: ai.rapids.cudf.CudfException: cuDF failure at: row_operators.cuh:361: Mismatched number of columns.
    at ai.rapids.cudf.Table.bound(Native Method)
    at ai.rapids.cudf.Table.upperBound(Table.java:1409)
    at ai.rapids.cudf.Table.upperBound(Table.java:1434)
    at com.nvidia.spark.rapids.GpuSorter.$anonfun$upperBound$2(SortUtils.scala:168)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuSorter.withResource(SortUtils.scala:65)
    at com.nvidia.spark.rapids.GpuSorter.$anonfun$upperBound$1(SortUtils.scala:167)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuSorter.withResource(SortUtils.scala:65)
    at com.nvidia.spark.rapids.GpuSorter.upperBound(SortUtils.scala:166)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$4(GpuRangePartitioner.scala:187)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$3(GpuRangePartitioner.scala:186)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$2(GpuRangePartitioner.scala:184)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.$anonfun$computeBoundsAndClose$1(GpuRangePartitioner.scala:182)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuRangePartitioner.withResource(GpuRangePartitioner.scala:167)
    at com.nvidia.spark.rapids.GpuRangePartitioner.computeBoundsAndClose(GpuRangePartitioner.scala:180)
    at com.nvidia.spark.rapids.GpuRangePartitioner.columnarEval(GpuRangePartitioner.scala:201)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$.$anonfun$prepareBatchShuffleDependency$3(GpuShuffleExchangeExec.scala:205)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.partNextBatch(GpuShuffleExchangeExec.scala:226)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExec$$anon$1.hasNext(GpuShuffleExchangeExec.scala:237)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at org.apache.spark.sql.rapids.RapidsCachingWriter.write(RapidsShuffleInternalManagerBase.scala:97)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels May 7, 2021
@revans2 revans2 self-assigned this May 7, 2021
@revans2
Copy link
Collaborator

revans2 commented May 7, 2021

I was able to reproduce this locally and it is starting to look like a bug in cudf. I went in and started to print out the types for partitions that failed and for partitions that passed.

GOOD: FIND IN TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
GOOD: FIND TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
GOOD: ORDERING WrappedArray(ORDER BY 0 ASC NULL SMALLEST)
ERR: FIND IN TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
ERR: FIND TBL TABLE: {{STRUCT: INT8, INT16, INT32, INT64, FLOAT32, FLOAT64, STRING, BOOL8, TIMESTAMP_DAYS, TIMESTAMP_MICROSECONDS, INT8}}
ERR: ORDERING WrappedArray(ORDER BY 0 ASC NULL SMALLEST)

And the schema is the same for both cases. I'll try to see if there is something special about the data in the failure cases too.

@jlowe
Copy link
Member

jlowe commented May 7, 2021

I think you're right, @revans2. There's an early check in cudf::upper_bound that the column counts match, and we're not failing there so the column counts are getting off somewhere within the implementation.

I suspect it has to do with the nullability or presence of nulls in the struct column, as it is flattening the struct columns and may or may not generate a BOOL8 column for the parent struct's validity. I think that could lead to a column count difference if one side has null structs and the other does not.

@revans2
Copy link
Collaborator

revans2 commented May 7, 2021

I have it down to a very small use case. My guess is that in one table we have a null struct and in another table we don't. Be aware that I saw some other scary errors while trying to debug this, so I will be digging a bit deeper.

DEBUG ERR: FIND IN TBL TABLE: {{STRUCT: INT32}} Table{columns=[ColumnVector{rows=8, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 59 7fd3ec18c530)}], cudfTable=140548175888736, rows=8}
COLUMN 0 - STRUCT
COLUMN 0:CHILD_0 - INT32
0 -2087175000
1 -1028677693
2 173016769
3 887174525
4 1578841898
5 1591161832
6 1859007115
7 2008070098
DEBUG ERR: FIND TBL TABLE: {{STRUCT: INT32}} Table{columns=[ColumnVector{rows=31, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 68 7fd3ec18fcf0)}], cudfTable=140548175876800, rows=31}
COLUMN 0 - STRUCT
0 NULL
COLUMN 0:CHILD_0 - INT32
0 NULL
1 NULL
2 -2087175000
3 -1345486411
4 -1028677693
5 -341142443
6 -282729822
7 -236270331
8 -100798596
9 -100515324
10 20129910
11 173016769
12 324918642
13 326215455
14 558978567
15 775624428
16 790204970
17 848988934
18 887174525
19 994238642
20 1259821658
21 1265224248
22 1338435109
23 1578841898
24 1591161832
25 1618216427
26 1727289611
27 1859007115
28 1931725428
29 2008070098
30 2063803297
ERR: ORDERING WrappedArray(ORDER BY 0 ASC NULL SMALLEST)

@revans2
Copy link
Collaborator

revans2 commented May 7, 2021

Yup they flatten the struct columns independent of each other so if there is a mismatch we get the wrong number of columns/possibly the wrong types too.

@revans2
Copy link
Collaborator

revans2 commented May 7, 2021

I filed rapidsai/cudf#8187 for this in cudf. But I am going to keep digging because I saw some other scarier behavior where I was trying to get a very small repro case we were getting the wrong number of columns out instead of an exception.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 11, 2021
@sameerz
Copy link
Collaborator

sameerz commented May 11, 2021

@abellina have we seen this fail since the cudf fix?

@abellina
Copy link
Collaborator Author

This can be closed. It is fixed by @ttnghia's change.

@gerashegalov
Copy link
Collaborator

This issue can be reproduced locally with #2507

TEST_PARALLEL=1 \
SPARK_HOME=~/dist/spark-3.1.1-bin-hadoop3.2 \
./integration_tests/run_pyspark_from_build.sh -k 'test_large_orderby and Struct'

@gerashegalov gerashegalov reopened this May 26, 2021
@revans2
Copy link
Collaborator

revans2 commented May 26, 2021

This issue can be reproduced locally with #2507

TEST_PARALLEL=1 \
SPARK_HOME=~/dist/spark-3.1.1-bin-hadoop3.2 \
./integration_tests/run_pyspark_from_build.sh -k 'test_large_orderby and Struct'

The other important part is to be able to match the same number of tasks. How many tasks did this run with, typically the number of CPU threads that your processor has?

@abellina
Copy link
Collaborator Author

I ran with @gerashegalov's branch locally with 12 tasks (that's the number of cores I have), for test_large_orderby[Struct(('child0', Long)):

../../src/main/python/sort_test.py::test_large_orderby[Struct(('child0', Long))] 21/05/26 13:22:36 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 66)
ai.rapids.cudf.CudfException: cuDF failure at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-324-cuda11/cpp/include/cudf/table/row_operators.cuh:361: Mismatched number of columns.
	at ai.rapids.cudf.Table.bound(Native Method)
	at ai.rapids.cudf.Table.upperBound(Table.java:1409)
	at ai.rapids.cudf.Table.upperBound(Table.java:1434)
	at com.nvidia.spark.rapids.GpuSorter.$anonfun$upperBound$2(SortUtils.scala:168)

These other tests passed:

../../src/main/python/sort_test.py::test_large_orderby_stable[Struct(('child0', Long))] PASSED [ 66%]
../../src/main/python/sort_test.py::test_large_orderby_nested_ridealong[Struct(('child1', Byte))] PASSED [100%]

@revans2
Copy link
Collaborator

revans2 commented May 26, 2021

Someone needs to get a reproducible use case filed against cudf. Also at this point the code freeze is today.

@sameerz the only option right now without a fix from cudf is to disable sorting for struct columns entirely. So we either need to do that or push for cudf to not freeze while we fix this.

@revans2
Copy link
Collaborator

revans2 commented May 26, 2021

I will get a repro case for cudf and file an issue

@abellina
Copy link
Collaborator Author

@revans2 sorry, I was looking into it, we synced, I'll take over and file an issue.

@abellina
Copy link
Collaborator Author

Seeing it here https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SortUtils.scala#L168, where both findInTable and findTable are struct columns, and have 0 nulls. Getting CUDF C++ build going to add a test.

One interesting thing is that the single row in findTbl is not in findInTbl.

FIND IN TBL => Table{columns=[ColumnVector{rows=1024, type=STRUCT, nullCount=Optional[0], offHeap=(ID: 1646 7fa9bc0d8440)}], cudfTable=140366980782704, rows=1024}
FIND TBL: Table{columns=[ColumnVector{rows=1, type=STRUCT, nullCount=Optional[0], offHeap=(ID: 1651 7fa9bc0d8340)}], cudfTable=140366981198240, rows=1}

@gerashegalov
Copy link
Collaborator

@revans2 I ran out of time yesterday to look deeper for even smaller , potentaill/y cudf-only repro , for this issue. I first encountered it with NUM_LOCAL_EXECS=2 CORES_PER_EXEC=6 but then consistently reproduced it with the single-JVM local mode with the command provided. My desktop has 12 cores.

@abellina
Copy link
Collaborator Author

Merged @gerashegalov's change with the added test, which is now passing CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

5 participants