-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the capture of fatal cuda error #10884
Improve the capture of fatal cuda error #10884
Conversation
Signed-off-by: sperlingxx <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #10884 +/- ##
===============================================
Coverage ? 86.32%
===============================================
Files ? 144
Lines ? 22696
Branches ? 0
===============================================
Hits ? 19593
Misses ? 3103
Partials ? 0 Continue to review full report at Codecov.
|
cpp/include/cudf/utilities/error.hpp
Outdated
// Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second | ||
// call doesn't return with cudaSuccess. | ||
cudaGetLastError(); | ||
auto const last = cudaGetLastError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, what? The two calls are necessary to detect a fatal error vs a non-fatal. The first call clears any pending error state. If the second call still sees an error, then it's extremely likely that a sticky error has occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the test case of fatal error can only be passed when I remove this line.
TEST(FatalCase, CudaFatalError)
{
auto type = cudf::data_type{cudf::type_id::INT32};
auto cv = cudf::column_view(type, 256, (void*)256);
cudf::binary_operation(cv, cv, cudf::binary_operator::ADD, type);
EXPECT_THROW(CUDF_CUDA_TRY(cudaDeviceSynchronize()), cudf::fatal_cuda_error);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this seems like a dubious change. What specifically fails with the test without this change? Does it throw too early, in cudf::binary_operation
or not throw at all? If the error is truly fatal, I don't see how removing a cudaGetLastError
call is going to help this test pass. With a fatal error, we should be able to call cudaGetLastError
as many times as we want, and it will never clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then that means the error being generated isn't a true sticky error, which is admittedly surprising with the test you're doing here I'd expect an illegal access error (which I'm pretty sure is sticky).
You could condense this down a bit by just launching a kernel like:
__global__ void fatal_kernel() {
__assert_fail(nullptr,nullptr,0,nullptr);
}
...
TEST(FatalCase, CudaFatalError)
{
fatal_kernel<<<1,1>>>();
EXPECT_THROW(CUDF_CUDA_TRY(cudaDeviceSynchronize()), cudf::fatal_cuda_error);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though you'd need to do this in a death test because it will corrupt the process context and leave the GPU unusable. See
cudf/cpp/tests/error/error_handling_test.cu
Lines 89 to 111 in 1db83e3
__global__ void assert_false_kernel() { cudf_assert(false && "this kernel should die"); } | |
__global__ void assert_true_kernel() { cudf_assert(true && "this kernel should live"); } | |
TEST(DebugAssertDeathTest, cudf_assert_false) | |
{ | |
testing::FLAGS_gtest_death_test_style = "threadsafe"; | |
auto call_kernel = []() { | |
assert_false_kernel<<<1, 1>>>(); | |
// Kernel should fail with `cudaErrorAssert` | |
// This error invalidates the current device context, so we need to kill | |
// the current process. Running with EXPECT_DEATH spawns a new process for | |
// each attempted kernel launch | |
if (cudaErrorAssert == cudaDeviceSynchronize()) { std::abort(); } | |
// If we reach this point, the cudf_assert didn't work so we exit normally, which will cause | |
// EXPECT_DEATH to fail. | |
}; | |
EXPECT_DEATH(call_kernel(), "this kernel should die"); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with fatal_kernel, but it throwed nothing.
I'm still not comfortable with removing a line of code without understanding why we're removing it. It may have helped your test case, but we need to understand how that was significant for that test (is it a problem with the test?) or how removing this will not create problems in other scenarios trying to detect fatal errors. If the CUDA error truly is fatal, it should not matter if we read the error an extra time. It should make it even more likely it truly is a fatal error if the error persists despite extra attempts at clearing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, according to the description in CUDA doc: Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess
, the cudaGetLastError API works like popping the top error from the CUDA error stack ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, according to the description in CUDA doc: Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess, the cudaGetLastError API works like popping the top error from the CUDA error stack ?
Yes, normally it clears the error, but there are categories of errors that are unclearable. These are the fatal errors we are trying to detect here. If you're finding that cudaGetLastError
is able to clear an error then it seems that error is not actually a fatal error and we should not report it as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jrhemstad, thank you for the link. However, my simple test suggests the second call of cudaGetLastError
cleans up fatal errors as well, if I don't misunderstand anything.
// a valid CUDA call
int* p0;
EXPECT_EQ(cudaMalloc(&p0, 128), cudaSuccess);
// produce an unrecoverable CUDA error: cudaErrorIllegalAddress
auto type = cudf::data_type{cudf::type_id::INT32};
auto cv = cudf::column_view(type, 256, (void*)256);
cudf::binary_operation(cv, cv, cudf::binary_operator::ADD, type);
// wait the illegal binary operation to finish, then capture the CUDA status
EXPECT_EQ(cudaDeviceSynchronize(), cudaErrorIllegalAddress);
EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress);
EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success
// Any subsequent CUDA calls will fail, since the CUDA context has been corrupted.
int* p1;
EXPECT_EQ(cudaMalloc(&p1, 128), cudaErrorIllegalAddress);
EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress);
EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success
int* p2;
EXPECT_EQ(cudaMalloc(&p2, 128), cudaErrorIllegalAddress);
EXPECT_EQ(cudaGetLastError(), cudaErrorIllegalAddress);
EXPECT_EQ(cudaGetLastError(), cudaSuccess); // the second call returns success
cpp/include/cudf/utilities/error.hpp
Outdated
// Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second | ||
// call doesn't return with cudaSuccess. | ||
cudaGetLastError(); | ||
auto const last = cudaGetLastError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this seems like a dubious change. What specifically fails with the test without this change? Does it throw too early, in cudf::binary_operation
or not throw at all? If the error is truly fatal, I don't see how removing a cudaGetLastError
call is going to help this test pass. With a fatal error, we should be able to call cudaGetLastError
as many times as we want, and it will never clear.
cpp/include/cudf/utilities/error.hpp
Outdated
// Calls cudaGetLastError again. It is nearly certain that a fatal error occurred if the second | ||
// call doesn't return with cudaSuccess. | ||
cudaGetLastError(); | ||
auto const last = cudaGetLastError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with fatal_kernel, but it throwed nothing.
I'm still not comfortable with removing a line of code without understanding why we're removing it. It may have helped your test case, but we need to understand how that was significant for that test (is it a problem with the test?) or how removing this will not create problems in other scenarios trying to detect fatal errors. If the CUDA error truly is fatal, it should not matter if we read the error an extra time. It should make it even more likely it truly is a fatal error if the error persists despite extra attempts at clearing it.
Signed-off-by: sperlingxx <[email protected]>
According to the result of @jrhemstad 's experiments, I used |
cudaGetLastError(); | ||
auto const last = cudaGetLastError(); | ||
auto const last = cudaFree(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this end up doing a full device synchronize as normal cudaFree
calls do? If it does, ideally we would want to find a CUDA call that can detect the error with minimal (ideally zero) synchronization with the device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, according to the CUDA doc, "If devPtr is 0, no operation is performed. cudaFree() returns cudaErrorValue in case of failure."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're guaranteed this doesn't do anything slow like a synchronize it seems OK to me, but I'll defer to @jrhemstad's judgement on whether this is the best approach with the limited tools we have to detect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If devPtr is 0, no operation is performed.
lol, well that's just straight up a lie given that 99% of the world uses cudaFree(0)
to force context initialization 🙃.
tbh, I've had my confidence shaken in the whole "sticky" error thing as a result of exploring this because of this PR.
The right long term solution is that we'll need to file an RFE to get a deterministic, programmatic way to query when the context is borked.
In the meantime, cudaFree(0)
seems about the least bad option available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just let the PR in, as a sort of workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be retargeted to 22.08.
cudaGetLastError(); | ||
auto const last = cudaGetLastError(); | ||
auto const last = cudaFree(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're guaranteed this doesn't do anything slow like a synchronize it seems OK to me, but I'll defer to @jrhemstad's judgement on whether this is the best approach with the limited tools we have to detect this.
Hi @jrhemstad, can you take another look at this PR? Thanks! |
Removing |
@gpucibot merge |
#10884 added a test that generates a CUDA fatal error, requiring a separate JVM process to avoid the error leaking into subsequent tests. There are some CI scripts that are selecting all tests and then deselecting some, and this new test needs to be also excluded to avoid running it in the same JVM as other tests. Authors: - Jason Lowe (https://github.com/jlowe) Approvers: - Thomas Graves (https://github.com/tgravescs) - Gera Shegalov (https://github.com/gerashegalov) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11083
This PR is a follow-up PR of #10630, which is to improve the capture of fatal cuda errors in libcudf and cudf java package.
cudaGetLastError
in throw_cuda_error, since the call returning the cuda error can be deemed as the first call.