-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] nightly ai.rapids.cudf.ReductionTest failed in cuda12 ENV after enable sanitizer #1349
Comments
summary
and error dump
|
A soft reminder, if it is not easy to fix in a short time, you can tag the failing tests with
|
This implies either we're not leaving enough memory reserved for the sanitizer to run (e.g.: using ARENA allocator somehow) or we're somehow sharing a GPU and there isn't enough free memory with the sanitizer present to run. |
confirmed always reproducible with cuda12 (cuda toolkit 12.0.1 + driver 525.58) in our nightly CI. Please help check if there is a quick fix (like reduce GPU mem cost of specific cases, adjust rmm pool size in test) or just tag |
I tried bumping up the memory pool but it still fails with an unspecified launch error. I'll post a cudf PR to add the noSanitizer tag to the failing tests. |
… CUDA 12 (#13904) Relates to NVIDIA/spark-rapids-jni#1349. The Java ReductionTest unit tests are failing when run under CUDA 12's compute-sanitizer but pass when run with the CUDA 11 version. To unblock CI, marking the affected tests to be run without the sanitizer in the interim while this is being investigated. Authors: - Jason Lowe (https://github.com/jlowe) Approvers: - Nghia Truong (https://github.com/ttnghia) - Gera Shegalov (https://github.com/gerashegalov) URL: #13904
Can you confirm what CCCL version was used in these builds? Is it the CCCL shipped with the CTK, or the CCCL (thrust/cub/libcudacxx) pinned by rapids-cmake? |
hi @bdice, the CTK in our CI was from official nvidia/cuda:12.0.1-devel-centos7 image
and for libs(thrust/cub/libcudacxx), they should be pulled by rapids-cmake while building cudf
|
Reproduced the Sanitizer error on CUDA 12 with customized code, the If I disabled the sanitizer running of The sanitizer error is:
Above error shows:
Seems the @bdice Do you have time to take a look? |
@res-life It would be good if you can share the repro case and code. |
My reproduce step is:
The
You can get the |
I have asked the cuDF team for help investigating here since I may not have enough time to look at this during 23.10 burndown. If you can create a pure C++ reproducer and file a PR to libcudf with the failing test, that would be great. |
Reproduced by cpp code. |
Describe the bug
The same tests with sanitizer passed correctly in cuda 11, but not cuda 12
pipeline: spark-rapids-jni_nightly-dev, build ID:512
attached sanitizer log: sanitizer_for_pid_20785.log
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: