-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029
Comments
Hi @jlowe, I have a rough idea on this issue: failing fast through
|
I think using the Ideally we should update the cudf bindings to throw a different type of exception for these sticky exceptions which will make them easier to classify in Java/Scala code. There's centralized code in the cudf Java bindings where this mapping can take place (i.e.: the As for which errors are "sticky" that should be primarily driven from the CUDA documentation on CUDA error codes. For example, any error would be considered "sticky" if it has this text in the description:
I would also add |
may also want to add cudaErrorECCUncorrectable to that list. |
A couple of examples of the Exceptions:
|
This PR is for NVIDIA/spark-rapids#5029 and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code. This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others. With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs. Authors: - Alfred Xu (https://github.com/sperlingxx) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #10551
Moving to 22.08 as there are cudf dependencies that will be in 22.08 |
Closes #5029 Detects unrecoverable (fatal) CUDA errors through the cuDF utility, which applys a more comprehensive way to determine whether a CUDA error is fatal or not. Signed-off-by: sperlingxx <[email protected]> Co-authored-by: Jason Lowe <[email protected]>
Is your feature request related to a problem? Please describe.
Certain CUDA errors, like illegal memory access, are "sticky," meaning that all CUDA operations to the GPU after the error will continue to return the same error over and over. No GPU operations will succeed after that point.
Describe the solution you'd like
The RAPIDS Accelerator should take measures to prevent further task execution on the executor once these "sticky" exceptions are detected. Tearing down the executor process is probably the best option, at least in the short-term. Without an external shuffle handler we will lose the shuffle of tasks that have completed, but this is probably a better way to "fail fast" then allow the executor to keep accepting new tasks only to have them fail the first time they touch the GPU.
The text was updated successfully, but these errors were encountered: