Improve cudf::cuda_error #10630

sperlingxx · 2022-04-11T09:57:26Z

Closes #10553

Improves cudf::cuda_error in two aspects:

Add a cudaError_t member to cudf::cuda_error and corresponding error_code() function that returns the error code
Breaks down cuda::cuda_error as sticky_cuda_error and cudart_error. sticky_cuda_error refers to fatal error on device.

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2022-04-11T09:59:23Z

I don't know how to produce a sticky error in the unit test.

codecov · 2022-04-11T12:52:41Z

Codecov Report

Merging #10630 (53156c9) into branch-22.06 (bf4ffc9) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10630      +/-   ##
================================================
+ Coverage         86.33%   86.36%   +0.02%     
================================================
  Files               140      142       +2     
  Lines             22289    22352      +63     
================================================
+ Hits              19244    19305      +61     
- Misses             3045     3047       +2

Impacted Files	Coverage Δ
python/cudf/cudf/core/frame.py	`93.67% <0.00%> (-1.09%)`	⬇️
python/dask_cudf/dask_cudf/tests/test_binops.py	`92.00% <0.00%> (-0.60%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.36% <0.00%> (-0.27%)`	⬇️
python/cudf/cudf/core/series.py	`95.28% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`79.60% <0.00%> (ø)`
python/dask_cudf/dask_cudf/backends.py	`86.44% <0.00%> (ø)`
python/dask_cudf/dask_cudf/tests/test_applymap.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/tests/utils.py	`90.90% <0.00%> (ø)`
python/cudf/cudf/core/single_column_frame.py	`96.52% <0.00%> (+0.07%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.69% <0.00%> (+0.10%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf4ffc9...53156c9. Read the comment docs.

cpp/include/cudf/utilities/error.hpp

jrhemstad · 2022-04-13T14:52:59Z

cpp/include/cudf/utilities/error.hpp

+#define CUDF_CUDA_TRY(call)                                                                    \
+  do {                                                                                         \
+    cudaError_t const status = (call);                                                         \
+    if (cudaSuccess != status) { cudf::detail::throw_cuda_error(status, __FILE__, __LINE__); } \


The semicolon after the while(0) should be removed to ensure all uses of the CUDF_CUDA_TRY macro are terminated with a semi-colon.

cpp/include/cudf/utilities/error.hpp

jrhemstad

Approving assuming the last minor issues are resolved.

jrhemstad · 2022-04-13T22:59:23Z

I don't know how to produce a sticky error in the unit test.

Hm, I think calling assert in device code should generate a sticky error.

Co-authored-by: Jake Hemstad <[email protected]>

sperlingxx · 2022-04-14T08:09:49Z

@gpucibot merge

This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others. With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs. Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #10551

This PR is a follow-up PR of #10630, which is to improve the capture of fatal cuda errors in libcudf and cudf java package. 1. libcudf: Removes the redundent call of `cudaGetLastError` in throw_cuda_error, since the call returning the cuda error can be deemed as the first call. 2. JNI: Leverages similar logic to discern fatal cuda errors from catched exceptions. The check at the JNI level is necessary because fatal cuda errors due to rmm APIs can not be distinguished. 3. Add C++ unit test for the capture of fatal cuda error 4. Add Java unit test for the capture of fatal cuda error Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Jason Lowe (https://github.com/jlowe) URL: #10884

sperlingxx added 2 commits April 11, 2022 17:29

enrich cuDF cuda_error

c8b9e6c

Signed-off-by: sperlingxx <[email protected]>

Merge remote-tracking branch 'origin/branch-22.06' into sticky_error

b85beee

sperlingxx requested a review from jrhemstad April 11, 2022 09:57

sperlingxx requested a review from a team as a code owner April 11, 2022 09:57

sperlingxx requested a review from davidwendt April 11, 2022 09:57

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 11, 2022

sperlingxx added cuda libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. cuda labels Apr 11, 2022

update year range

df095e3

karthikeyann approved these changes Apr 11, 2022

View reviewed changes

jrhemstad reviewed Apr 11, 2022

View reviewed changes

cpp/include/cudf/utilities/error.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Apr 11, 2022

View reviewed changes

cpp/include/cudf/utilities/error.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Apr 11, 2022

View reviewed changes

cpp/include/cudf/utilities/error.hpp Outdated Show resolved Hide resolved

jrhemstad reviewed Apr 11, 2022

View reviewed changes

cpp/include/cudf/utilities/error.hpp Show resolved Hide resolved

sperlingxx added 2 commits April 12, 2022 14:32

update

9c61429

update

a4837fb

jrhemstad reviewed Apr 12, 2022

View reviewed changes

cpp/include/cudf/utilities/error.hpp Outdated Show resolved Hide resolved

sperlingxx added 2 commits April 13, 2022 16:53

add

51bf915

with JNI

4139239

sperlingxx requested a review from a team as a code owner April 13, 2022 09:12

github-actions bot added the Java Affects Java cuDF API. label Apr 13, 2022

fix

3209291

sperlingxx removed the non-breaking Non-breaking change label Apr 13, 2022

sperlingxx added the breaking Breaking change label Apr 13, 2022

revert JNI

5a45016

sperlingxx added non-breaking Non-breaking change and removed breaking Breaking change labels Apr 13, 2022

revert JNI

56d00d7

github-actions bot removed the Java Affects Java cuDF API. label Apr 13, 2022

sperlingxx mentioned this pull request Apr 13, 2022

JNI: throw CUDA errors more specifically #10551

Merged

jrhemstad reviewed Apr 13, 2022

View reviewed changes

cpp/include/cudf/utilities/error.hpp Outdated Show resolved Hide resolved

jrhemstad approved these changes Apr 13, 2022

View reviewed changes

sperlingxx and others added 2 commits April 14, 2022 09:55

Update cpp/include/cudf/utilities/error.hpp

2e5844c

Co-authored-by: Jake Hemstad <[email protected]>

fix

53156c9

rapids-bot bot merged commit 22a6679 into rapidsai:branch-22.06 Apr 14, 2022

sperlingxx deleted the sticky_error branch April 14, 2022 08:10

sperlingxx mentioned this pull request May 18, 2022

Improve the capture of fatal cuda error #10884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cudf::cuda_error #10630

Improve cudf::cuda_error #10630

sperlingxx commented Apr 11, 2022

sperlingxx commented Apr 11, 2022

codecov bot commented Apr 11, 2022 •

edited

Loading

jrhemstad Apr 13, 2022

jrhemstad left a comment

jrhemstad commented Apr 13, 2022

sperlingxx commented Apr 14, 2022

Improve cudf::cuda_error #10630

Improve cudf::cuda_error #10630

Conversation

sperlingxx commented Apr 11, 2022

sperlingxx commented Apr 11, 2022

codecov bot commented Apr 11, 2022 • edited Loading

Codecov Report

jrhemstad Apr 13, 2022

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

jrhemstad commented Apr 13, 2022

sperlingxx commented Apr 14, 2022

codecov bot commented Apr 11, 2022 •

edited

Loading