-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Mismatched number of columns while performing GpuSort
#2368
Comments
I was able to reproduce this locally and it is starting to look like a bug in cudf. I went in and started to print out the types for partitions that failed and for partitions that passed.
And the schema is the same for both cases. I'll try to see if there is something special about the data in the failure cases too. |
I think you're right, @revans2. There's an early check in I suspect it has to do with the nullability or presence of nulls in the struct column, as it is flattening the struct columns and may or may not generate a |
I have it down to a very small use case. My guess is that in one table we have a null struct and in another table we don't. Be aware that I saw some other scary errors while trying to debug this, so I will be digging a bit deeper.
|
Yup they flatten the struct columns independent of each other so if there is a mismatch we get the wrong number of columns/possibly the wrong types too. |
I filed rapidsai/cudf#8187 for this in cudf. But I am going to keep digging because I saw some other scarier behavior where I was trying to get a very small repro case we were getting the wrong number of columns out instead of an exception. |
@abellina have we seen this fail since the cudf fix? |
This can be closed. It is fixed by @ttnghia's change. |
This issue can be reproduced locally with #2507
|
The other important part is to be able to match the same number of tasks. How many tasks did this run with, typically the number of CPU threads that your processor has? |
I ran with @gerashegalov's branch locally with 12 tasks (that's the number of cores I have), for
These other tests passed:
|
Someone needs to get a reproducible use case filed against cudf. Also at this point the code freeze is today. @sameerz the only option right now without a fix from cudf is to disable sorting for struct columns entirely. So we either need to do that or push for cudf to not freeze while we fix this. |
I will get a repro case for cudf and file an issue |
@revans2 sorry, I was looking into it, we synced, I'll take over and file an issue. |
Seeing it here https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SortUtils.scala#L168, where both findInTable and findTable are struct columns, and have 0 nulls. Getting CUDF C++ build going to add a test. One interesting thing is that the single row in
|
@revans2 I ran out of time yesterday to look deeper for even smaller , potentaill/y cudf-only repro , for this issue. I first encountered it with |
Merged @gerashegalov's change with the added test, which is now passing CI. |
We are seeing an exception on 9 tests of the integration suite when executing in a distributed fashion (a box with 4 T4s), and we run with UCX. I haven't run without UCX in this environment, so I am not entirely sure if it's UCX specific or just the distributed part.
The tests that failed are:
The exception was in
GpuSort
while inupperBound
:The text was updated successfully, but these errors were encountered: