JNI: throw CUDA errors more specifically #10551

sperlingxx · 2022-03-31T11:46:32Z

This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others.

With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs.

Signed-off-by: sperlingxx <[email protected]>

codecov · 2022-03-31T12:58:41Z

Codecov Report

Merging #10551 (50bfc2c) into branch-22.06 (65b1cbd) will increase coverage by 0.04%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10551      +/-   ##
================================================
+ Coverage         86.35%   86.39%   +0.04%     
================================================
  Files               142      142              
  Lines             22335    22302      -33     
================================================
- Hits              19287    19268      -19     
+ Misses             3048     3034      -14

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/column.py	`89.43% <0.00%> (-0.02%)`	⬇️
python/cudf/cudf/core/frame.py	`93.41% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`95.15% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`93.75% <0.00%> (+<0.01%)`	⬆️
python/cudf/cudf/utils/utils.py	`90.35% <0.00%> (+0.06%)`	⬆️
python/cudf/cudf/core/column/string.py	`89.21% <0.00%> (+0.10%)`	⬆️
python/cudf/cudf/testing/_utils.py	`93.98% <0.00%> (+0.13%)`	⬆️
python/cudf/cudf/core/multiindex.py	`92.28% <0.00%> (+0.13%)`	⬆️
python/cudf/cudf/core/reshape.py	`90.00% <0.00%> (+0.17%)`	⬆️
python/cudf/cudf/core/column/categorical.py	`89.97% <0.00%> (+0.20%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65b1cbd...50bfc2c. Read the comment docs.

jrhemstad · 2022-03-31T13:23:26Z

java/src/main/native/include/jni_utils.hpp

      if (jt != NULL) {                                                                            \
        env->Throw(jt);                                                                            \
      }                                                                                            \
      return ret_val;                                                                              \
    }                                                                                              \
  }

+#define JNI_CUDA_CHECK(env, cuda_status)                                                           \


Best practice would be to put this in a do{...} while(0)

Changed. Thanks for advice!

jrhemstad · 2022-03-31T13:24:09Z

java/src/main/native/include/jni_utils.hpp

+  // Build the error message in the format of cudf::cuda_error, so that cudf::jni::CUDA_ERROR_CLASS
+  // can parse both of them.
+  std::string n_msg = "CUDA error encountered at: " + std::string{file} + ":" +


I wouldn't count on the contents of that exception message being stable.

I'd rather see cudf::cuda_error updated to allow the extraction of the CUDA error ID rather than relying on parsing. In general we should be moving away from string-scraping for error identification, not adding more instances of it.

Filed #10553

jrhemstad · 2022-03-31T13:26:22Z

java/src/main/java/ai/rapids/cudf/CudaException.java

+    private static final Set<CudaError> stickyErrors = new HashSet<CudaError>(){{
+      add(CudaError.cudaErrorIllegalAddress);
+      add(CudaError.cudaErrorLaunchTimeout);
+      add(CudaError.cudaErrorHardwareStackError);
+      add(CudaError.cudaErrorIllegalInstruction);
+      add(CudaError.cudaErrorMisalignedAddress);
+      add(CudaError.cudaErrorInvalidAddressSpace);
+      add(CudaError.cudaErrorInvalidPc);
+      add(CudaError.cudaErrorLaunchFailure);
+      add(CudaError.cudaErrorExternalDevice);
+      add(CudaError.cudaErrorUnknown);
+    }};


This isn't very robust. I described a more robust way to detect sticky errors here: #10200 (comment)

Soon I hope to have libcudf throw a separate exception type for sticky errors.

+1, it would be good to align this with what's going on in that other issue.

I totally agree with @jrhemstad as well. So, shall we pend the JNI-side work for the time being until the libcudf is enhanced in terms of CUDA error handling?

The RAPIDS Accelerator is already addressing this for the short-term at NVIDIA/spark-rapids#5118. Therefore I'd rather we take the time here to to leverage a proper interface in libcudf rather than rush this in and then need to change it when libcudf refines its exception handling soon afterwards.

Hi @jlowe, I reworked the PR. For now, it pushes down the sticky error detection to libcudf.

However, I am stuck on how to trigger a fatal CUDA error through the unit test.

java/src/main/java/ai/rapids/cudf/CudaException.java

Signed-off-by: sperlingxx <[email protected]>

…cudf into break_down_catch_std

java/src/main/java/ai/rapids/cudf/CudaException.java

…tch_std

java/src/main/java/ai/rapids/cudf/CudaException.java

java/src/main/native/include/jni_utils.hpp

java/src/main/native/src/CudaJni.cpp

java/src/main/native/include/jni_utils.hpp

jlowe

As it is now, this needs to be marked as a breaking change, since we're either changing the names or removing interfaces in `java/src/main/native/include/jni_utils.hpp". That header is used in other projects, like cuspatial. It would be nice if we didn't change too much here and break things unnecessarily.

java/src/main/native/include/jni_utils.hpp

java/src/main/native/src/RmmJni.cpp

java/src/main/native/include/jni_utils.hpp

…tch_std

java/src/main/native/include/jni_utils.hpp

java/src/main/native/src/RmmJni.cpp

java/src/main/native/include/jni_utils.hpp

Co-authored-by: Jason Lowe <[email protected]>

sperlingxx · 2022-04-24T01:53:00Z

@gpucibot merge

sperlingxx added 6 commits March 31, 2022 16:12

init

7d2f1b2

Signed-off-by: sperlingxx <[email protected]>

fix

2755b44

fix

d0d7f30

update

8cc58a4

fix

eced2c3

update

fa1ae8a

sperlingxx added Java Affects Java cuDF API. 4 - Needs cuDF (Java) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 31, 2022

sperlingxx requested a review from jlowe March 31, 2022 11:46

sperlingxx requested a review from a team as a code owner March 31, 2022 11:46

jrhemstad reviewed Mar 31, 2022

View reviewed changes

jlowe reviewed Mar 31, 2022

View reviewed changes

java/src/main/java/ai/rapids/cudf/CudaException.java Outdated Show resolved Hide resolved

sperlingxx added 2 commits April 1, 2022 12:00

refine

877a165

fix

3fb086f

sperlingxx added 3 commits April 12, 2022 16:19

update

71238e1

Signed-off-by: sperlingxx <[email protected]>

Merge branch 'break_down_catch_std' of https://github.com/sperlingxx/…

6b38d6f

…cudf into break_down_catch_std

update

c242002

sperlingxx changed the title ~~JNI: throw CUDA errors more specificially~~ JNI: throw CUDA errors more specifically Apr 13, 2022

jlowe reviewed Apr 13, 2022

View reviewed changes

java/src/main/java/ai/rapids/cudf/CudaException.java Outdated Show resolved Hide resolved

sperlingxx added 5 commits April 14, 2022 16:59

update

2e625af

Merge remote-tracking branch 'origin/branch-22.06' into break_down_ca…

d35a11b

…tch_std

update

f62961f

fix

fc5672d

fix

1594453

sperlingxx added 2 commits April 14, 2022 19:54

fix

c11f8bb

update

5f991a1

jlowe reviewed Apr 15, 2022

View reviewed changes

sperlingxx added 2 commits April 18, 2022 13:49

refine

65633bc

update

5bc9e6e

jlowe reviewed Apr 18, 2022

View reviewed changes

update

0b54829

jlowe reviewed Apr 19, 2022

View reviewed changes

sperlingxx added 3 commits April 20, 2022 15:34

Merge remote-tracking branch 'origin/branch-22.06' into break_down_ca…

8ec4271

…tch_std

update

7c8a34a

update

5cb1977

jlowe reviewed Apr 20, 2022

View reviewed changes

java/src/main/native/include/jni_utils.hpp Show resolved Hide resolved

java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved

java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved

jlowe reviewed Apr 20, 2022

View reviewed changes

java/src/main/native/include/jni_utils.hpp Show resolved Hide resolved

sperlingxx added 2 commits April 21, 2022 10:52

update

6ba9190

update

16c5e70

jlowe reviewed Apr 21, 2022

View reviewed changes

java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved

java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved

sperlingxx and others added 3 commits April 22, 2022 09:54

Update java/src/main/native/include/jni_utils.hpp

907c67e

Co-authored-by: Jason Lowe <[email protected]>

Update java/src/main/native/include/jni_utils.hpp

e313a2a

Co-authored-by: Jason Lowe <[email protected]>

fix

50bfc2c

jlowe approved these changes Apr 22, 2022

View reviewed changes

rapids-bot bot merged commit ae7e979 into rapidsai:branch-22.06 Apr 24, 2022

sperlingxx deleted the break_down_catch_std branch April 24, 2022 01:53

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JNI: throw CUDA errors more specifically #10551

JNI: throw CUDA errors more specifically #10551

sperlingxx commented Mar 31, 2022 •

edited

Loading

codecov bot commented Mar 31, 2022 •

edited

Loading

jrhemstad Mar 31, 2022

sperlingxx Apr 1, 2022

jrhemstad Mar 31, 2022 •

edited

Loading

jlowe Mar 31, 2022

jrhemstad Mar 31, 2022

jrhemstad Mar 31, 2022

jlowe Mar 31, 2022

sperlingxx Apr 1, 2022 •

edited

Loading

jlowe Apr 1, 2022

sperlingxx Apr 13, 2022

sperlingxx Apr 13, 2022

jlowe left a comment

sperlingxx commented Apr 24, 2022

JNI: throw CUDA errors more specifically #10551

JNI: throw CUDA errors more specifically #10551

Conversation

sperlingxx commented Mar 31, 2022 • edited Loading

codecov bot commented Mar 31, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx Apr 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

sperlingxx commented Apr 24, 2022

sperlingxx commented Mar 31, 2022 •

edited

Loading

codecov bot commented Mar 31, 2022 •

edited

Loading

jrhemstad Mar 31, 2022 •

edited

Loading

sperlingxx Apr 1, 2022 •

edited

Loading