Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the capture of fatal cuda error #10884

Merged
merged 12 commits into from
Jun 7, 2022

Conversation

sperlingxx
Copy link
Contributor

This PR is a follow-up PR of #10630, which is to improve the capture of fatal cuda errors in libcudf and cudf java package.

  1. libcudf: Removes the redundent call of cudaGetLastError in throw_cuda_error, since the call returning the cuda error can be deemed as the first call.
  2. JNI: Leverages similar logic to discern fatal cuda errors from catched exceptions. The check at the JNI level is necessary because fatal cuda errors due to rmm APIs can not be distinguished.
  3. Add C++ unit test for the capture of fatal cuda error
  4. Add Java unit test for the capture of fatal cuda error

Signed-off-by: sperlingxx <[email protected]>
@sperlingxx sperlingxx requested review from jlowe and jrhemstad May 18, 2022 06:27
@sperlingxx sperlingxx requested review from a team as code owners May 18, 2022 06:27
@sperlingxx sperlingxx requested a review from vyasr May 18, 2022 06:27
@github-actions github-actions bot added Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels May 18, 2022
@sperlingxx sperlingxx added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 18, 2022
@codecov
Copy link

codecov bot commented May 18, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@ad00e44). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10884   +/-   ##
===============================================
  Coverage                ?   86.32%           
===============================================
  Files                   ?      144           
  Lines                   ?    22696           
  Branches                ?        0           
===============================================
  Hits                    ?    19593           
  Misses                  ?     3103           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad00e44...182c15d. Read the comment docs.

Comment on lines 117 to 119
// Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second
// call doesn't return with cudaSuccess.
cudaGetLastError();
auto const last = cudaGetLastError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what? The two calls are necessary to detect a fatal error vs a non-fatal. The first call clears any pending error state. If the second call still sees an error, then it's extremely likely that a sticky error has occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the test case of fatal error can only be passed when I remove this line.

TEST(FatalCase, CudaFatalError)
{
  auto type = cudf::data_type{cudf::type_id::INT32};
  auto cv   = cudf::column_view(type, 256, (void*)256);
  cudf::binary_operation(cv, cv, cudf::binary_operator::ADD, type);
  EXPECT_THROW(CUDF_CUDA_TRY(cudaDeviceSynchronize()), cudf::fatal_cuda_error);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this seems like a dubious change. What specifically fails with the test without this change? Does it throw too early, in cudf::binary_operation or not throw at all? If the error is truly fatal, I don't see how removing a cudaGetLastError call is going to help this test pass. With a fatal error, we should be able to call cudaGetLastError as many times as we want, and it will never clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then that means the error being generated isn't a true sticky error, which is admittedly surprising with the test you're doing here I'd expect an illegal access error (which I'm pretty sure is sticky).

You could condense this down a bit by just launching a kernel like:

__global__ void fatal_kernel() {
    __assert_fail(nullptr,nullptr,0,nullptr);
}
...
TEST(FatalCase, CudaFatalError)
{
  fatal_kernel<<<1,1>>>();
  EXPECT_THROW(CUDF_CUDA_TRY(cudaDeviceSynchronize()), cudf::fatal_cuda_error);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though you'd need to do this in a death test because it will corrupt the process context and leave the GPU unusable. See

__global__ void assert_false_kernel() { cudf_assert(false && "this kernel should die"); }
__global__ void assert_true_kernel() { cudf_assert(true && "this kernel should live"); }
TEST(DebugAssertDeathTest, cudf_assert_false)
{
testing::FLAGS_gtest_death_test_style = "threadsafe";
auto call_kernel = []() {
assert_false_kernel<<<1, 1>>>();
// Kernel should fail with `cudaErrorAssert`
// This error invalidates the current device context, so we need to kill
// the current process. Running with EXPECT_DEATH spawns a new process for
// each attempted kernel launch
if (cudaErrorAssert == cudaDeviceSynchronize()) { std::abort(); }
// If we reach this point, the cudf_assert didn't work so we exit normally, which will cause
// EXPECT_DEATH to fail.
};
EXPECT_DEATH(call_kernel(), "this kernel should die");
}
for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with fatal_kernel, but it throwed nothing.

I'm still not comfortable with removing a line of code without understanding why we're removing it. It may have helped your test case, but we need to understand how that was significant for that test (is it a problem with the test?) or how removing this will not create problems in other scenarios trying to detect fatal errors. If the CUDA error truly is fatal, it should not matter if we read the error an extra time. It should make it even more likely it truly is a fatal error if the error persists despite extra attempts at clearing it.

Copy link
Contributor Author

@sperlingxx sperlingxx May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, according to the description in CUDA doc: Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess, the cudaGetLastError API works like popping the top error from the CUDA error stack ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, according to the description in CUDA doc: Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess, the cudaGetLastError API works like popping the top error from the CUDA error stack ?

Yes, normally it clears the error, but there are categories of errors that are unclearable. These are the fatal errors we are trying to detect here. If you're finding that cudaGetLastError is able to clear an error then it seems that error is not actually a fatal error and we should not report it as such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@sperlingxx sperlingxx May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jrhemstad, thank you for the link. However, my simple test suggests the second call of cudaGetLastError cleans up fatal errors as well, if I don't misunderstand anything.

  // a valid CUDA call
  int* p0;
  EXPECT_EQ(cudaMalloc(&p0, 128), cudaSuccess);

  // produce an unrecoverable CUDA error: cudaErrorIllegalAddress
  auto type = cudf::data_type{cudf::type_id::INT32};
  auto cv   = cudf::column_view(type, 256, (void*)256);
  cudf::binary_operation(cv, cv, cudf::binary_operator::ADD, type);
  // wait the illegal binary operation to finish, then capture the CUDA status
  EXPECT_EQ(cudaDeviceSynchronize(), cudaErrorIllegalAddress);
  EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress);
  EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success

  // Any subsequent CUDA calls will fail, since the CUDA context has been corrupted.
  int* p1;
  EXPECT_EQ(cudaMalloc(&p1, 128), cudaErrorIllegalAddress);
  EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress);
  EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success

  int* p2;
  EXPECT_EQ(cudaMalloc(&p2, 128), cudaErrorIllegalAddress);
  EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress);
  EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success

java/pom.xml Outdated Show resolved Hide resolved
java/pom.xml Outdated Show resolved Hide resolved
java/pom.xml Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/CudaTest.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/CudaTest.java Outdated Show resolved Hide resolved
Comment on lines 117 to 119
// Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second
// call doesn't return with cudaSuccess.
cudaGetLastError();
auto const last = cudaGetLastError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this seems like a dubious change. What specifically fails with the test without this change? Does it throw too early, in cudf::binary_operation or not throw at all? If the error is truly fatal, I don't see how removing a cudaGetLastError call is going to help this test pass. With a fatal error, we should be able to call cudaGetLastError as many times as we want, and it will never clear.

java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/CudaTest.java Outdated Show resolved Hide resolved
Comment on lines 117 to 119
// Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second
// call doesn't return with cudaSuccess.
cudaGetLastError();
auto const last = cudaGetLastError();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with fatal_kernel, but it throwed nothing.

I'm still not comfortable with removing a line of code without understanding why we're removing it. It may have helped your test case, but we need to understand how that was significant for that test (is it a problem with the test?) or how removing this will not create problems in other scenarios trying to detect fatal errors. If the CUDA error truly is fatal, it should not matter if we read the error an extra time. It should make it even more likely it truly is a fatal error if the error persists despite extra attempts at clearing it.

java/src/main/native/include/jni_utils.hpp Outdated Show resolved Hide resolved
@sperlingxx
Copy link
Contributor Author

According to the result of @jrhemstad 's experiments, I used cudaFree(0) instead of cudaGetLastError to detect the fatal error.

cudaGetLastError();
auto const last = cudaGetLastError();
auto const last = cudaFree(0);
Copy link
Contributor

@jlowe jlowe May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this end up doing a full device synchronize as normal cudaFree calls do? If it does, ideally we would want to find a CUDA call that can detect the error with minimal (ideally zero) synchronization with the device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, according to the CUDA doc, "If devPtr is 0, no operation is performed. cudaFree() returns cudaErrorValue in case of failure."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're guaranteed this doesn't do anything slow like a synchronize it seems OK to me, but I'll defer to @jrhemstad's judgement on whether this is the best approach with the limited tools we have to detect this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If devPtr is 0, no operation is performed.

lol, well that's just straight up a lie given that 99% of the world uses cudaFree(0) to force context initialization 🙃.

tbh, I've had my confidence shaken in the whole "sticky" error thing as a result of exploring this because of this PR.

The right long term solution is that we'll need to file an RFE to get a deterministic, programmatic way to query when the context is borked.

In the meantime, cudaFree(0) seems about the least bad option available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just let the PR in, as a sort of workaround?

Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be retargeted to 22.08.

cudaGetLastError();
auto const last = cudaGetLastError();
auto const last = cudaFree(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're guaranteed this doesn't do anything slow like a synchronize it seems OK to me, but I'll defer to @jrhemstad's judgement on whether this is the best approach with the limited tools we have to detect this.

@sperlingxx sperlingxx requested review from a team as code owners May 30, 2022 05:35
@sperlingxx sperlingxx requested review from trxcllnt and removed request for a team May 30, 2022 05:35
@github-actions github-actions bot added CMake CMake build issue conda Python Affects Python cuDF API. labels May 30, 2022
@sperlingxx sperlingxx changed the base branch from branch-22.06 to branch-22.08 May 30, 2022 05:36
@sperlingxx sperlingxx requested a review from jrhemstad May 30, 2022 08:09
@sperlingxx
Copy link
Contributor Author

Hi @jrhemstad, can you take another look at this PR? Thanks!

@ajschmidt8 ajschmidt8 removed the request for review from a team June 1, 2022 20:55
@ajschmidt8
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@sperlingxx
Copy link
Contributor Author

@gpucibot merge

@sperlingxx sperlingxx requested a review from jlowe June 7, 2022 02:00
@rapids-bot rapids-bot bot merged commit 4dfd684 into rapidsai:branch-22.08 Jun 7, 2022
rapids-bot bot pushed a commit that referenced this pull request Jun 16, 2022
#10884 added a test that generates a CUDA fatal error, requiring a separate JVM process to avoid the error leaking into subsequent tests.  There are some CI scripts that are selecting all tests and then deselecting some, and this new test needs to be also excluded to avoid running it in the same JVM as other tests.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Thomas Graves (https://github.com/tgravescs)
  - Gera Shegalov (https://github.com/gerashegalov)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #11083
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants