Improve the capture of fatal cuda error #10884

sperlingxx · 2022-05-18T06:27:30Z

This PR is a follow-up PR of #10630, which is to improve the capture of fatal cuda errors in libcudf and cudf java package.

libcudf: Removes the redundent call of cudaGetLastError in throw_cuda_error, since the call returning the cuda error can be deemed as the first call.
JNI: Leverages similar logic to discern fatal cuda errors from catched exceptions. The check at the JNI level is necessary because fatal cuda errors due to rmm APIs can not be distinguished.
Add C++ unit test for the capture of fatal cuda error
Add Java unit test for the capture of fatal cuda error

Signed-off-by: sperlingxx <[email protected]>

codecov · 2022-05-18T10:14:59Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@ad00e44). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10884   +/-   ##
===============================================
  Coverage                ?   86.32%           
===============================================
  Files                   ?      144           
  Lines                   ?    22696           
  Branches                ?        0           
===============================================
  Hits                    ?    19593           
  Misses                  ?     3103           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad00e44...182c15d. Read the comment docs.

jrhemstad · 2022-05-18T12:39:01Z

cpp/include/cudf/utilities/error.hpp

+  // Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second
  // call doesn't return with cudaSuccess.
-  cudaGetLastError();
  auto const last = cudaGetLastError();


Wait, what? The two calls are necessary to detect a fatal error vs a non-fatal. The first call clears any pending error state. If the second call still sees an error, then it's extremely likely that a sticky error has occurred.

While the test case of fatal error can only be passed when I remove this line.

TEST(FatalCase, CudaFatalError) { auto type = cudf::data_type{cudf::type_id::INT32}; auto cv = cudf::column_view(type, 256, (void*)256); cudf::binary_operation(cv, cv, cudf::binary_operator::ADD, type); EXPECT_THROW(CUDF_CUDA_TRY(cudaDeviceSynchronize()), cudf::fatal_cuda_error); }

Agree that this seems like a dubious change. What specifically fails with the test without this change? Does it throw too early, in cudf::binary_operation or not throw at all? If the error is truly fatal, I don't see how removing a cudaGetLastError call is going to help this test pass. With a fatal error, we should be able to call cudaGetLastError as many times as we want, and it will never clear.

Then that means the error being generated isn't a true sticky error, which is admittedly surprising with the test you're doing here I'd expect an illegal access error (which I'm pretty sure is sticky).

You could condense this down a bit by just launching a kernel like:

__global__ void fatal_kernel() { __assert_fail(nullptr,nullptr,0,nullptr); } ... TEST(FatalCase, CudaFatalError) { fatal_kernel<<<1,1>>>(); EXPECT_THROW(CUDF_CUDA_TRY(cudaDeviceSynchronize()), cudf::fatal_cuda_error); }

Though you'd need to do this in a death test because it will corrupt the process context and leave the GPU unusable. See

cudf/cpp/tests/error/error_handling_test.cu

Lines 89 to 111 in 1db83e3

__global__ void assert_false_kernel() { cudf_assert(false && "this kernel should die"); }

__global__ void assert_true_kernel() { cudf_assert(true && "this kernel should live"); }

TEST(DebugAssertDeathTest, cudf_assert_false)

{

testing::FLAGS_gtest_death_test_style = "threadsafe";

auto call_kernel = []() {

assert_false_kernel<<<1, 1>>>();

// Kernel should fail with `cudaErrorAssert`

// This error invalidates the current device context, so we need to kill

// the current process. Running with EXPECT_DEATH spawns a new process for

// each attempted kernel launch

if (cudaErrorAssert == cudaDeviceSynchronize()) { std::abort(); }

// If we reach this point, the cudf_assert didn't work so we exit normally, which will cause

// EXPECT_DEATH to fail.

};

EXPECT_DEATH(call_kernel(), "this kernel should die");

}

for example.

I tried with fatal_kernel, but it throwed nothing.

I'm still not comfortable with removing a line of code without understanding why we're removing it. It may have helped your test case, but we need to understand how that was significant for that test (is it a problem with the test?) or how removing this will not create problems in other scenarios trying to detect fatal errors. If the CUDA error truly is fatal, it should not matter if we read the error an extra time. It should make it even more likely it truly is a fatal error if the error persists despite extra attempts at clearing it.

Hi @jlowe, according to the description in CUDA doc: Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess, the cudaGetLastError API works like popping the top error from the CUDA error stack ?

Hi @jlowe, according to the description in CUDA doc: Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess, the cudaGetLastError API works like popping the top error from the CUDA error stack ?

Yes, normally it clears the error, but there are categories of errors that are unclearable. These are the fatal errors we are trying to detect here. If you're finding that cudaGetLastError is able to clear an error then it seems that error is not actually a fatal error and we should not report it as such.

@sperlingxx see https://stackoverflow.com/questions/31642520/states-of-memory-data-after-cuda-exceptions/31642573#31642573

Hi @jrhemstad, thank you for the link. However, my simple test suggests the second call of cudaGetLastError cleans up fatal errors as well, if I don't misunderstand anything.

// a valid CUDA call int* p0; EXPECT_EQ(cudaMalloc(&p0, 128), cudaSuccess); // produce an unrecoverable CUDA error: cudaErrorIllegalAddress auto type = cudf::data_type{cudf::type_id::INT32}; auto cv = cudf::column_view(type, 256, (void*)256); cudf::binary_operation(cv, cv, cudf::binary_operator::ADD, type); // wait the illegal binary operation to finish, then capture the CUDA status EXPECT_EQ(cudaDeviceSynchronize(), cudaErrorIllegalAddress); EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress); EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success // Any subsequent CUDA calls will fail, since the CUDA context has been corrupted. int* p1; EXPECT_EQ(cudaMalloc(&p1, 128), cudaErrorIllegalAddress); EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress); EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success int* p2; EXPECT_EQ(cudaMalloc(&p2, 128), cudaErrorIllegalAddress); EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress); EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success

java/pom.xml

java/src/test/java/ai/rapids/cudf/CudaTest.java

jlowe · 2022-05-19T14:51:31Z

cpp/include/cudf/utilities/error.hpp

+  // Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second
  // call doesn't return with cudaSuccess.
-  cudaGetLastError();
  auto const last = cudaGetLastError();


Agree that this seems like a dubious change. What specifically fails with the test without this change? Does it throw too early, in cudf::binary_operation or not throw at all? If the error is truly fatal, I don't see how removing a cudaGetLastError call is going to help this test pass. With a fatal error, we should be able to call cudaGetLastError as many times as we want, and it will never clear.

java/src/main/native/include/jni_utils.hpp

java/src/test/java/ai/rapids/cudf/CudaTest.java

jlowe · 2022-05-20T14:30:05Z

cpp/include/cudf/utilities/error.hpp

+  // Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second
  // call doesn't return with cudaSuccess.
-  cudaGetLastError();
  auto const last = cudaGetLastError();


I tried with fatal_kernel, but it throwed nothing.

I'm still not comfortable with removing a line of code without understanding why we're removing it. It may have helped your test case, but we need to understand how that was significant for that test (is it a problem with the test?) or how removing this will not create problems in other scenarios trying to detect fatal errors. If the CUDA error truly is fatal, it should not matter if we read the error an extra time. It should make it even more likely it truly is a fatal error if the error persists despite extra attempts at clearing it.

java/src/main/native/include/jni_utils.hpp

Signed-off-by: sperlingxx <[email protected]>

…tal_error

sperlingxx · 2022-05-26T11:11:52Z

According to the result of @jrhemstad 's experiments, I used cudaFree(0) instead of cudaGetLastError to detect the fatal error.

jlowe · 2022-05-26T13:10:10Z

cpp/include/cudf/utilities/error.hpp

  cudaGetLastError();
-  auto const last = cudaGetLastError();
+  auto const last = cudaFree(0);


Does this end up doing a full device synchronize as normal cudaFree calls do? If it does, ideally we would want to find a CUDA call that can detect the error with minimal (ideally zero) synchronization with the device.

Hi @jlowe, according to the CUDA doc, "If devPtr is 0, no operation is performed. cudaFree() returns cudaErrorValue in case of failure."

If we're guaranteed this doesn't do anything slow like a synchronize it seems OK to me, but I'll defer to @jrhemstad's judgement on whether this is the best approach with the limited tools we have to detect this.

If devPtr is 0, no operation is performed.

lol, well that's just straight up a lie given that 99% of the world uses cudaFree(0) to force context initialization 🙃.

tbh, I've had my confidence shaken in the whole "sticky" error thing as a result of exploring this because of this PR.

The right long term solution is that we'll need to file an RFE to get a deterministic, programmatic way to query when the context is borked.

In the meantime, cudaFree(0) seems about the least bad option available.

Shall we just let the PR in, as a sort of workaround?

jlowe

This needs to be retargeted to 22.08.

jlowe · 2022-05-27T14:28:10Z

cpp/include/cudf/utilities/error.hpp

  cudaGetLastError();
-  auto const last = cudaGetLastError();
+  auto const last = cudaFree(0);


If we're guaranteed this doesn't do anything slow like a synchronize it seems OK to me, but I'll defer to @jrhemstad's judgement on whether this is the best approach with the limited tools we have to detect this.

…tal_error

sperlingxx · 2022-06-01T03:10:50Z

Hi @jrhemstad, can you take another look at this PR? Thanks!

ajschmidt8 · 2022-06-01T20:56:01Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

sperlingxx · 2022-06-07T01:52:07Z

@gpucibot merge

#10884 added a test that generates a CUDA fatal error, requiring a separate JVM process to avoid the error leaking into subsequent tests. There are some CI scripts that are selecting all tests and then deselecting some, and this new test needs to be also excluded to avoid running it in the same JVM as other tests. Authors: - Jason Lowe (https://github.com/jlowe) Approvers: - Thomas Graves (https://github.com/tgravescs) - Gera Shegalov (https://github.com/gerashegalov) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11083

sperlingxx added 2 commits May 18, 2022 13:56

update

823b145

Signed-off-by: sperlingxx <[email protected]>

revert style

404ce92

sperlingxx requested review from jlowe and jrhemstad May 18, 2022 06:27

sperlingxx requested review from a team as code owners May 18, 2022 06:27

sperlingxx requested a review from vyasr May 18, 2022 06:27

github-actions bot added Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 18, 2022

sperlingxx added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 18, 2022

sperlingxx mentioned this pull request May 18, 2022

Halt Spark executor when encountering unrecoverable CUDA errors NVIDIA/spark-rapids#5350

Merged

jrhemstad reviewed May 18, 2022

View reviewed changes

jlowe reviewed May 18, 2022

View reviewed changes

java/pom.xml Outdated Show resolved Hide resolved

sperlingxx added 4 commits May 19, 2022 15:15

udpate

ef6e881

udpate

3dcb354

udpate

e0d026f

udpate

b610762

jlowe reviewed May 19, 2022

View reviewed changes

sperlingxx added 2 commits May 20, 2022 19:16

update

7e66695

update

b91e248

jlowe reviewed May 20, 2022

View reviewed changes

sperlingxx added 3 commits May 26, 2022 19:05

update

ac595a2

Signed-off-by: sperlingxx <[email protected]>

remove lines

61058af

Merge remote-tracking branch 'origin/branch-22.06' into catch_cuda_fa…

3af3cc8

…tal_error

jlowe reviewed May 26, 2022

View reviewed changes

jlowe reviewed May 27, 2022

View reviewed changes

Merge remote-tracking branch 'origin/branch-22.08' into catch_cuda_fa…

182c15d

…tal_error

sperlingxx requested review from a team as code owners May 30, 2022 05:35

sperlingxx requested review from trxcllnt and removed request for a team May 30, 2022 05:35

github-actions bot added CMake CMake build issue conda Python Affects Python cuDF API. labels May 30, 2022

sperlingxx changed the base branch from branch-22.06 to branch-22.08 May 30, 2022 05:36

sperlingxx requested a review from jrhemstad May 30, 2022 08:09

ajschmidt8 removed the request for review from a team June 1, 2022 20:55

jrhemstad approved these changes Jun 6, 2022

View reviewed changes

sperlingxx requested a review from jlowe June 7, 2022 02:00

jlowe approved these changes Jun 7, 2022

View reviewed changes

rapids-bot bot merged commit 4dfd684 into rapidsai:branch-22.08 Jun 7, 2022

This was referenced Jun 7, 2022

Port CUDA fatal test pom changes from cudf NVIDIA/spark-rapids-jni#319

Closed

Exclude CudaFatalTest when selecting all Java tests #11083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the capture of fatal cuda error #10884

Improve the capture of fatal cuda error #10884

sperlingxx commented May 18, 2022

codecov bot commented May 18, 2022 •

edited

Loading

jrhemstad May 18, 2022

sperlingxx May 19, 2022

jlowe May 19, 2022

jrhemstad May 19, 2022

jrhemstad May 19, 2022

jlowe May 20, 2022

sperlingxx May 23, 2022 •

edited

Loading

jlowe May 23, 2022

jrhemstad May 23, 2022

sperlingxx May 24, 2022 •

edited

Loading

jlowe May 19, 2022

jlowe May 20, 2022

sperlingxx commented May 26, 2022

jlowe May 26, 2022 •

edited

Loading

sperlingxx May 27, 2022

jlowe May 27, 2022

jrhemstad Jun 3, 2022

sperlingxx Jun 6, 2022

jlowe left a comment

jlowe May 27, 2022

sperlingxx commented Jun 1, 2022

ajschmidt8 commented Jun 1, 2022

sperlingxx commented Jun 7, 2022

	__global__ void assert_false_kernel() { cudf_assert(false && "this kernel should die"); }

	__global__ void assert_true_kernel() { cudf_assert(true && "this kernel should live"); }

	TEST(DebugAssertDeathTest, cudf_assert_false)
	{
	testing::FLAGS_gtest_death_test_style = "threadsafe";

	auto call_kernel = []() {
	assert_false_kernel<<<1, 1>>>();

	// Kernel should fail with `cudaErrorAssert`
	// This error invalidates the current device context, so we need to kill
	// the current process. Running with EXPECT_DEATH spawns a new process for
	// each attempted kernel launch
	if (cudaErrorAssert == cudaDeviceSynchronize()) { std::abort(); }

	// If we reach this point, the cudf_assert didn't work so we exit normally, which will cause
	// EXPECT_DEATH to fail.
	};

	EXPECT_DEATH(call_kernel(), "this kernel should die");
	}

Improve the capture of fatal cuda error #10884

Improve the capture of fatal cuda error #10884

Conversation

sperlingxx commented May 18, 2022

codecov bot commented May 18, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx May 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx May 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented May 26, 2022

jlowe May 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Jun 1, 2022

ajschmidt8 commented Jun 1, 2022

sperlingxx commented Jun 7, 2022

codecov bot commented May 18, 2022 •

edited

Loading

sperlingxx May 23, 2022 •

edited

Loading

sperlingxx May 24, 2022 •

edited

Loading

jlowe May 26, 2022 •

edited

Loading