Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Halt Spark executor when encountering unrecoverable CUDA errors #5350

Merged
merged 11 commits into from
Jun 13, 2022

Conversation

sperlingxx
Copy link
Collaborator

Signed-off-by: sperlingxx [email protected]

Closes #5029

Following #5118, this PR detects unrecoverable (fatal) CUDA errors through the cuDF utility, which applys a more comprehensive way to determine whether a CUDA error is fatal or not.

@sperlingxx sperlingxx requested review from jlowe and tgravescs April 28, 2022 03:58
@sperlingxx
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Apr 28, 2022
private val CUDF_EXCEPTION: String = classOf[ai.rapids.cudf.CudfException].getName

def isCudaException(ef: ExceptionFailure): Boolean = {
ef.description.contains(CUDA_EXCEPTION) || ef.toErrorString.contains(CUDA_EXCEPTION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the cudf exceptions not serializable? Otherwise it looks like we should be able to use exception() function call to look at the actual exception type. I guess we can keep the string parsing if the exception is None. @jlowe for opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should pursue a solution that does not rely on scraping description messages if possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: sperlingxx <[email protected]>
@sperlingxx
Copy link
Collaborator Author

build

@@ -328,6 +324,34 @@ object RapidsExecutorPlugin {
zipped.forall { case (e, a) => e <= a }
}
}

private val CUDA_EXCEPTION: String = classOf[ai.rapids.cudf.CudaException].getName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why not import the classes rather than fully qualify them everywhere?

Comment on lines 335 to 337
case None =>
ef.description.contains(CUDA_EXCEPTION) ||
ef.toErrorString.contains(CUDA_EXCEPTION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to scrape the description message now that we're catching it by the class type? Has the code been tested to verify we are not coming through the message-scraping code path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, I encountered a problem: I couldn't find out an easy way to trigger CUDA errors via spark APIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can trigger an illegal address via some hackery via the cudf Java APIs. For example, this will build a column view that points to an invalid device address and triggers an illegal address exception:

import ai.rapids.cudf._
class BadDeviceBuffer extends BaseDeviceMemoryBuffer(256L, 256L, null.asInstanceOf[MemoryBuffer.MemoryBufferCleaner]) {
  override def slice(offset: Long, len: Long): MemoryBuffer = null
}

[...]

withResource(ColumnView.fromDeviceBuffer(new BadDeviceBuffer(), 0, DType.INT8, 256)) { v =>
  withResource(v.add(v)) { vsum =>
    Cuda.DEFAULT_STREAM.sync
  }
}

Note that it may be a bit tricky to do this as an automated test, as the GPU will become useless for any subsequent tests in the same process. Once the illegal address is triggered, most CUDA APIs will fail until the process is terminated.

Copy link
Collaborator Author

@sperlingxx sperlingxx May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, I tried the above approach, it did creating a fatal CUDA error. And I found that cuDF JNI couldn't indentify (fatal) CUDA errors lead by RMM calls. Because CUDA calls in rmm are wrapped with RMM_CUDA_TRY_ALLOC instead of CUDF_CUDA_TRY:

#define RMM_CUDA_TRY_ALLOC(_call)                                                                  \
  do {                                                                                             \
    cudaError_t const error = (_call);                                                             \
    if (cudaSuccess != error) {                                                                    \
      cudaGetLastError();                                                                          \
      auto const msg = std::string{"CUDA error at: "} + __FILE__ + ":" + RMM_STRINGIFY(__LINE__) + \
                       ": " + cudaGetErrorName(error) + " " + cudaGetErrorString(error);           \
      if (cudaErrorMemoryAllocation == error) {                                                    \
        throw rmm::out_of_memory{msg};                                                             \
      } else {                                                                                     \
        throw rmm::bad_alloc{msg};                                                                 \
      }                                                                                            \
    }                                                                                              \
  } while (0)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we catch allrmm::bad_alloc as CudaFatalException in cuDF JNI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we catch allrmm::bad_alloc as CudaFatalException in cuDF JNI?

rmm::bad_alloc simply means "something else went wrong other than an allocation failure." It does not necessarily mean a CUDA fatal exception occurred.

However this demonstrates that the CUDA errors are asynchronous, and we're not guaranteed that libcudf is the one processing the error directly (and thus translating it into a CUDA fatal exception). That means we really have no choice but to fallback to scraping the messages, looking for something that looks like a fatal CUDA error (i.e.: something like Tom's original change). We can still leverage the cudf exception types, but I don't see a way to avoid scraping the messages in the worst case. 😞

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems no other way but scraping the error messages, in order to make sure that we catch as many fatal errors as possible. I added back Tom's original approach as a double check over CUDA errors thrown by cuDF/rmm, since libcudf may not be able to catch all fatal errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reminded that we can avoid scraping the error messages by using the same type of CUDA fatal error detection algorithm as implemented by libcudf to decide when to throw a fatal CUDA error. The problem with scraping the error message is that there's no guarantee that the information needed to discern a fatal CUDA error is in the error message string (since the error could have been detected and thrown by different code than libcudf).

So we can still leverage the cudf specific fatal exceptions, but for any other type of exception we should double-check the CUDA error state, using the same algorithm libcudf uses, to detect fatal CUDA errors not caught by libcudf code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, I filed a cuDF PR (10884) to do it.

@sperlingxx
Copy link
Collaborator Author

build

1 similar comment
@sperlingxx
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator

Please retarget for 22.08

@sperlingxx sperlingxx changed the base branch from branch-22.06 to branch-22.08 June 6, 2022 05:44
@sperlingxx
Copy link
Collaborator Author

Hi @tgravescs, I retargeted to branch-22.08 and I removed the message scraping codes for double-check, since cuDF can fully take care of the fatal error detection with rapidsai/cudf#10884.

@tgravescs
Copy link
Collaborator

going to leave this for @jlowe for now since he is familiar with the cudf change

@sperlingxx
Copy link
Collaborator Author

build

@sameerz sameerz added this to the Jun 6 - Jun 17 milestone Jun 8, 2022
jlowe
jlowe previously approved these changes Jun 8, 2022
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx merged commit 4f95734 into NVIDIA:branch-22.08 Jun 13, 2022
@sperlingxx sperlingxx deleted the recognize_cuda_error branch June 13, 2022 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors
4 participants