-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cudf::cuda_error #10630
Improve cudf::cuda_error #10630
Conversation
Signed-off-by: sperlingxx <[email protected]>
I don't know how to produce a sticky error in the unit test. |
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10630 +/- ##
================================================
+ Coverage 86.33% 86.36% +0.02%
================================================
Files 140 142 +2
Lines 22289 22352 +63
================================================
+ Hits 19244 19305 +61
- Misses 3045 3047 +2
Continue to review full report at Codecov.
|
#define CUDF_CUDA_TRY(call) \ | ||
do { \ | ||
cudaError_t const status = (call); \ | ||
if (cudaSuccess != status) { cudf::detail::throw_cuda_error(status, __FILE__, __LINE__); } \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semicolon after the while(0)
should be removed to ensure all uses of the CUDF_CUDA_TRY
macro are terminated with a semi-colon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving assuming the last minor issues are resolved.
Hm, I think calling |
Co-authored-by: Jake Hemstad <[email protected]>
@gpucibot merge |
This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others. With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs. Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #10551
This PR is a follow-up PR of #10630, which is to improve the capture of fatal cuda errors in libcudf and cudf java package. 1. libcudf: Removes the redundent call of `cudaGetLastError` in throw_cuda_error, since the call returning the cuda error can be deemed as the first call. 2. JNI: Leverages similar logic to discern fatal cuda errors from catched exceptions. The check at the JNI level is necessary because fatal cuda errors due to rmm APIs can not be distinguished. 3. Add C++ unit test for the capture of fatal cuda error 4. Add Java unit test for the capture of fatal cuda error Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Jason Lowe (https://github.com/jlowe) URL: #10884
Closes #10553
Improves
cudf::cuda_error
in two aspects:cudf::cuda_error
and corresponding error_code() function that returns the error codecuda::cuda_error
assticky_cuda_error
andcudart_error
.sticky_cuda_error
refers to fatal error on device.