-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Illegal memory access while sorting lists #12201
Comments
If the bug shows up randomly then it looks similar to what we have experienced the out-of-bound error in the past: temporary |
I'm able to reliably reproduce it if I run a large suite of hash aggregation pyspark tests, but it's not reliable if I try to cut the number of tests down to just the ones that were failing. I also was able to save off the input data of a list sort operation that failed as a Parquet file, but loading that Parquet data and executing the list sort does not reproduce the problem, even with compute-sanitizer running and the memory pool disabled. I'm currently doing some experiments to verify which libcudf commit introduced the issue. |
Can you try disabling rmm memory pool when testing your saved parquet file with cudf list sort? |
Yes, that's what I already tried without success.
|
This appears to be triggered by 90f0a77 (#11969). Spark plugin tests pass on the commit just before this and reliably fail after pulling in just this commit. I still cannot explain why I cannot reproduce it in isolation. There must be some state (e.g.: values of unallocated memory, state of GPU cache, etc.) that is key to getting it to reliably fail. |
@jlowe Do you know the data type of the sorted lists? Knowing it can help identify the source of issue better. |
In the case where I can reliably get it to fail as part of a larger suite, the list is of booleans (so BOOL8 in cudf). Here's the value from the debug dump into Parquet of the data just before the list sort call:
|
So it is bool type with nulls. Previously this would be sorted by cub radix sort. Now it is sorted by |
There are no nulls in the sample data. Spark thinks the type is nullable, but in practice the data contains no nulls. I believe this is a bug in the fast sort algorithm. @davidwendt asked me to hack the code to disable the fast algorithm, and test pass with that hack. |
@davidwendt From the old code:
So there are 13 params. In the new code:
And the number of params is just 11. Should this be the issue? If not then should we test switching back to |
Looks like there's a bug in cub::DeviceSegmentedSort::SortPairs, at least for boolean values. I added printing of the indices before and after the cub calls, and here's the debug output. The output shows the incoming column's offsets and data, along with the indices before and after. Note that one of the indices towards the end becomes negative which explains the bad behavior of the gather call later.
I also tried this debug build without Spark using the data I captured as a Parquet file (same data as in the log) and that same output index slot that becomes negative becomes zero instead, and duplicates another entry in the gather map that is zero (i.e.: it's a bad but not invalid gather map). So we should be able to construct a test using the input values shown in the log above and show that the cub call is generating a resulting index/gather map that contains duplicated or invalid keys when the input did not have any duplicates. @davidwendt volunteered to write the C++ repro and see if that is able to show the issue. |
Error appears to occur with
Output
Garbage value 32724 at position 33 |
I'm so glad that we can identify and catch the bug so quickly 👍. Thanks :) |
Created CUB issue here: NVIDIA/cub#594 |
Verified that adding a
The CUB issue is specifically due to an error in handling the |
Closed by #12217 |
Describe the bug
We have recently seen memory corruption and GPU illegal address crashes in the RAPIDS Accelerator for Apache Spark 22.12 nightly test runs since Nov 16. It appears to be related to sorting lists. See NVIDIA/spark-rapids#7092 (comment) for an illegal access captured by compute-sanitizer.
Steps/Code to reproduce bug
So far we can only sometimes reproduce this by running a large suite of RAPIDS Accelerator integration tests. I'm working on trying to narrow this down to something that only requires cudf to reproduce, but I wanted to file this early to raise awareness.
Expected behavior
Tests should pass
Environment overview (please complete the following information)
RAPIDS Accelerator for Apache Spark 22.12.0-SNAPSHOT on Apache Spark 3.1.2 using spark-rapids-jni-22.12.0-SNAPSHOT based on cudf 22.12.0
The text was updated successfully, but these errors were encountered: