-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Halt Spark executor when encountering unrecoverable CUDA errors #5350
Halt Spark executor when encountering unrecoverable CUDA errors #5350
Conversation
Signed-off-by: sperlingxx <[email protected]>
build |
private val CUDF_EXCEPTION: String = classOf[ai.rapids.cudf.CudfException].getName | ||
|
||
def isCudaException(ef: ExceptionFailure): Boolean = { | ||
ef.description.contains(CUDA_EXCEPTION) || ef.toErrorString.contains(CUDA_EXCEPTION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the cudf exceptions not serializable? Otherwise it looks like we should be able to use exception() function call to look at the actual exception type. I guess we can keep the string parsing if the exception is None. @jlowe for opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we should pursue a solution that does not rely on scraping description messages if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: sperlingxx <[email protected]>
build |
@@ -328,6 +324,34 @@ object RapidsExecutorPlugin { | |||
zipped.forall { case (e, a) => e <= a } | |||
} | |||
} | |||
|
|||
private val CUDA_EXCEPTION: String = classOf[ai.rapids.cudf.CudaException].getName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why not import the classes rather than fully qualify them everywhere?
case None => | ||
ef.description.contains(CUDA_EXCEPTION) || | ||
ef.toErrorString.contains(CUDA_EXCEPTION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to scrape the description message now that we're catching it by the class type? Has the code been tested to verify we are not coming through the message-scraping code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, I encountered a problem: I couldn't find out an easy way to trigger CUDA errors via spark APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can trigger an illegal address via some hackery via the cudf Java APIs. For example, this will build a column view that points to an invalid device address and triggers an illegal address exception:
import ai.rapids.cudf._
class BadDeviceBuffer extends BaseDeviceMemoryBuffer(256L, 256L, null.asInstanceOf[MemoryBuffer.MemoryBufferCleaner]) {
override def slice(offset: Long, len: Long): MemoryBuffer = null
}
[...]
withResource(ColumnView.fromDeviceBuffer(new BadDeviceBuffer(), 0, DType.INT8, 256)) { v =>
withResource(v.add(v)) { vsum =>
Cuda.DEFAULT_STREAM.sync
}
}
Note that it may be a bit tricky to do this as an automated test, as the GPU will become useless for any subsequent tests in the same process. Once the illegal address is triggered, most CUDA APIs will fail until the process is terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, I tried the above approach, it did creating a fatal CUDA error. And I found that cuDF JNI couldn't indentify (fatal) CUDA errors lead by RMM calls. Because CUDA calls in rmm are wrapped with RMM_CUDA_TRY_ALLOC
instead of CUDF_CUDA_TRY
:
#define RMM_CUDA_TRY_ALLOC(_call) \
do { \
cudaError_t const error = (_call); \
if (cudaSuccess != error) { \
cudaGetLastError(); \
auto const msg = std::string{"CUDA error at: "} + __FILE__ + ":" + RMM_STRINGIFY(__LINE__) + \
": " + cudaGetErrorName(error) + " " + cudaGetErrorString(error); \
if (cudaErrorMemoryAllocation == error) { \
throw rmm::out_of_memory{msg}; \
} else { \
throw rmm::bad_alloc{msg}; \
} \
} \
} while (0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we catch allrmm::bad_alloc
as CudaFatalException
in cuDF JNI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we catch allrmm::bad_alloc as CudaFatalException in cuDF JNI?
rmm::bad_alloc simply means "something else went wrong other than an allocation failure." It does not necessarily mean a CUDA fatal exception occurred.
However this demonstrates that the CUDA errors are asynchronous, and we're not guaranteed that libcudf is the one processing the error directly (and thus translating it into a CUDA fatal exception). That means we really have no choice but to fallback to scraping the messages, looking for something that looks like a fatal CUDA error (i.e.: something like Tom's original change). We can still leverage the cudf exception types, but I don't see a way to avoid scraping the messages in the worst case. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it seems no other way but scraping the error messages, in order to make sure that we catch as many fatal errors as possible. I added back Tom's original approach as a double check over CUDA errors thrown by cuDF/rmm, since libcudf may not be able to catch all fatal errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reminded that we can avoid scraping the error messages by using the same type of CUDA fatal error detection algorithm as implemented by libcudf to decide when to throw a fatal CUDA error. The problem with scraping the error message is that there's no guarantee that the information needed to discern a fatal CUDA error is in the error message string (since the error could have been detected and thrown by different code than libcudf).
So we can still leverage the cudf specific fatal exceptions, but for any other type of exception we should double-check the CUDA error state, using the same algorithm libcudf uses, to detect fatal CUDA errors not caught by libcudf code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: sperlingxx <[email protected]>
build |
1 similar comment
build |
Please retarget for 22.08 |
Signed-off-by: sperlingxx <[email protected]>
Hi @tgravescs, I retargeted to branch-22.08 and I removed the message scraping codes for double-check, since cuDF can fully take care of the fatal error detection with rapidsai/cudf#10884. |
going to leave this for @jlowe for now since he is familiar with the cudf change |
build |
Co-authored-by: Jason Lowe <[email protected]>
Signed-off-by: sperlingxx <[email protected]>
Co-authored-by: Jason Lowe <[email protected]>
build |
Signed-off-by: sperlingxx [email protected]
Closes #5029
Following #5118, this PR detects unrecoverable (fatal) CUDA errors through the cuDF utility, which applys a more comprehensive way to determine whether a CUDA error is fatal or not.