Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cudf::cuda_error #10630

Merged
merged 12 commits into from
Apr 14, 2022
Merged

Conversation

sperlingxx
Copy link
Contributor

Closes #10553

Improves cudf::cuda_error in two aspects:

  1. Add a cudaError_t member to cudf::cuda_error and corresponding error_code() function that returns the error code
  2. Breaks down cuda::cuda_error as sticky_cuda_error and cudart_error. sticky_cuda_error refers to fatal error on device.

@sperlingxx sperlingxx requested a review from jrhemstad April 11, 2022 09:57
@sperlingxx sperlingxx requested a review from a team as a code owner April 11, 2022 09:57
@sperlingxx sperlingxx requested a review from davidwendt April 11, 2022 09:57
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 11, 2022
@sperlingxx sperlingxx added cuda libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. cuda labels Apr 11, 2022
@sperlingxx
Copy link
Contributor Author

I don't know how to produce a sticky error in the unit test.

@codecov
Copy link

codecov bot commented Apr 11, 2022

Codecov Report

Merging #10630 (53156c9) into branch-22.06 (bf4ffc9) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10630      +/-   ##
================================================
+ Coverage         86.33%   86.36%   +0.02%     
================================================
  Files               140      142       +2     
  Lines             22289    22352      +63     
================================================
+ Hits              19244    19305      +61     
- Misses             3045     3047       +2     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 93.67% <0.00%> (-1.09%) ⬇️
python/dask_cudf/dask_cudf/tests/test_binops.py 92.00% <0.00%> (-0.60%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.36% <0.00%> (-0.27%) ⬇️
python/cudf/cudf/core/series.py 95.28% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 79.60% <0.00%> (ø)
python/dask_cudf/dask_cudf/backends.py 86.44% <0.00%> (ø)
python/dask_cudf/dask_cudf/tests/test_applymap.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/tests/utils.py 90.90% <0.00%> (ø)
python/cudf/cudf/core/single_column_frame.py 96.52% <0.00%> (+0.07%) ⬆️
python/cudf/cudf/core/dataframe.py 93.69% <0.00%> (+0.10%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf4ffc9...53156c9. Read the comment docs.

@sperlingxx sperlingxx requested a review from a team as a code owner April 13, 2022 09:12
@github-actions github-actions bot added the Java Affects Java cuDF API. label Apr 13, 2022
@sperlingxx sperlingxx removed the non-breaking Non-breaking change label Apr 13, 2022
@sperlingxx sperlingxx added the breaking Breaking change label Apr 13, 2022
@sperlingxx sperlingxx added non-breaking Non-breaking change and removed breaking Breaking change labels Apr 13, 2022
@github-actions github-actions bot removed the Java Affects Java cuDF API. label Apr 13, 2022
#define CUDF_CUDA_TRY(call) \
do { \
cudaError_t const status = (call); \
if (cudaSuccess != status) { cudf::detail::throw_cuda_error(status, __FILE__, __LINE__); } \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semicolon after the while(0) should be removed to ensure all uses of the CUDF_CUDA_TRY macro are terminated with a semi-colon.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving assuming the last minor issues are resolved.

@jrhemstad
Copy link
Contributor

I don't know how to produce a sticky error in the unit test.

Hm, I think calling assert in device code should generate a sticky error.

@sperlingxx
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 22a6679 into rapidsai:branch-22.06 Apr 14, 2022
@sperlingxx sperlingxx deleted the sticky_error branch April 14, 2022 08:10
rapids-bot bot pushed a commit that referenced this pull request Apr 24, 2022
This PR is for NVIDIA/spark-rapids#5029  and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code.  This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others.

With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #10551
rapids-bot bot pushed a commit that referenced this pull request Jun 7, 2022
This PR is a follow-up PR of #10630, which is to improve the capture of fatal cuda errors in libcudf and cudf java package.

1. libcudf: Removes the redundent call of `cudaGetLastError` in throw_cuda_error, since the call returning the cuda error can be deemed as the first call.
2. JNI: Leverages similar logic to discern fatal cuda errors from catched exceptions. The check at the JNI level is necessary because fatal cuda errors due to rmm APIs can not be distinguished.
3. Add C++ unit test for the capture of fatal cuda error 
4. Add Java unit test for the capture of fatal cuda error

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Jason Lowe (https://github.com/jlowe)

URL: #10884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improvements to cudf::cuda_error
3 participants