-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Suspected memory corruption with regexp calls #11768
Comments
I'd like to request a C++ reproducer if possible. |
I'm not sure how the NVIDIA/spark-rapids#6431 is converted to libcudf calls. But converting 1 billion long values to a strings column would require almost 9GB of character bytes. So casting the longs to strings is perhaps where the corruption is occurring. |
The data is split across six partitions in this case, with each partition containing ~166mm rows. The config allows two concurrent tasks on GPU. Setting Here is debug output from one run:
Here is another run of the same query
We are working on a C++ repro case. |
Fixes an out-of-bounds write error when a large number of strings requires a strided loop to meet an internal memory maximum. For row sizes that do not require strided loops, the row index never exceeds the size of the column preventing any out-of-bounds access. For large row counts, the CUDA `thread index` may be larger than the minimal count used for building the working-memory buffer. Since the kernel is launched with a thread-count with a specific block size, extra threads past the end of the minimal count are necessary to fill out the last block. These threads never contribute to the overall result but will attempt to access past the end of the working memory. Writing to this memory may corrupt memory for another kernel launched in parallel from another CPU thread. This change adds logic to prevent the extra threads from doing any work. Fixes #11768 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11797
Describe the bug
We are seeing some Spark queries involving regexp calls produce different results on each run and have established that
matchesRe
is producing incorrect results on the GPU in a very non-deterministic way. This only happens with large input columns.Steps/Code to reproduce bug
The tracking issue is NVIDIA/spark-rapids#6431 and I am working on creating a simple repro case that does not involve Spark but have failed to achieve that so far.
Expected behavior
Calls to
matchesRe
should produce the same results each time (given the same input).Environment overview (please complete the following information)
Bare metal. Workstation with RTX 6000.
Environment details
Click here to see environment details
Additional context
None
The text was updated successfully, but these errors were encountered: