Halt Spark executor when encountering unrecoverable CUDA errors #5350

sperlingxx · 2022-04-28T03:58:31Z

Signed-off-by: sperlingxx [email protected]

Following #5118, this PR detects unrecoverable (fatal) CUDA errors through the cuDF utility, which applys a more comprehensive way to determine whether a CUDA error is fatal or not.

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2022-04-28T04:05:11Z

build

tgravescs · 2022-04-28T14:25:33Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

+  private val CUDF_EXCEPTION: String = classOf[ai.rapids.cudf.CudfException].getName
+
+  def isCudaException(ef: ExceptionFailure): Boolean = {
+    ef.description.contains(CUDA_EXCEPTION) || ef.toErrorString.contains(CUDA_EXCEPTION)


are the cudf exceptions not serializable? Otherwise it looks like we should be able to use exception() function call to look at the actual exception type. I guess we can keep the string parsing if the exception is None. @jlowe for opinion

Agree we should pursue a solution that does not rely on scraping description messages if possible.

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2022-04-29T10:03:21Z

build

jlowe · 2022-04-29T15:15:32Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

@@ -328,6 +324,34 @@ object RapidsExecutorPlugin {
      zipped.forall { case (e, a) => e <= a }
    }
  }
+
+  private val CUDA_EXCEPTION: String = classOf[ai.rapids.cudf.CudaException].getName


Curious, why not import the classes rather than fully qualify them everywhere?

jlowe · 2022-04-29T15:18:50Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

+    case None =>
+      ef.description.contains(CUDA_EXCEPTION) ||
+        ef.toErrorString.contains(CUDA_EXCEPTION)


Is it necessary to scrape the description message now that we're catching it by the class type? Has the code been tested to verify we are not coming through the message-scraping code path?

Hi @jlowe, I encountered a problem: I couldn't find out an easy way to trigger CUDA errors via spark APIs.

You can trigger an illegal address via some hackery via the cudf Java APIs. For example, this will build a column view that points to an invalid device address and triggers an illegal address exception:

import ai.rapids.cudf._ class BadDeviceBuffer extends BaseDeviceMemoryBuffer(256L, 256L, null.asInstanceOf[MemoryBuffer.MemoryBufferCleaner]) { override def slice(offset: Long, len: Long): MemoryBuffer = null } [...] withResource(ColumnView.fromDeviceBuffer(new BadDeviceBuffer(), 0, DType.INT8, 256)) { v => withResource(v.add(v)) { vsum => Cuda.DEFAULT_STREAM.sync } }

Note that it may be a bit tricky to do this as an automated test, as the GPU will become useless for any subsequent tests in the same process. Once the illegal address is triggered, most CUDA APIs will fail until the process is terminated.

Hi @jlowe, I tried the above approach, it did creating a fatal CUDA error. And I found that cuDF JNI couldn't indentify (fatal) CUDA errors lead by RMM calls. Because CUDA calls in rmm are wrapped with RMM_CUDA_TRY_ALLOC instead of CUDF_CUDA_TRY:

#define RMM_CUDA_TRY_ALLOC(_call) \ do { \ cudaError_t const error = (_call); \ if (cudaSuccess != error) { \ cudaGetLastError(); \ auto const msg = std::string{"CUDA error at: "} + __FILE__ + ":" + RMM_STRINGIFY(__LINE__) + \ ": " + cudaGetErrorName(error) + " " + cudaGetErrorString(error); \ if (cudaErrorMemoryAllocation == error) { \ throw rmm::out_of_memory{msg}; \ } else { \ throw rmm::bad_alloc{msg}; \ } \ } \ } while (0)

Shall we catch allrmm::bad_alloc as CudaFatalException in cuDF JNI?

Shall we catch allrmm::bad_alloc as CudaFatalException in cuDF JNI?

rmm::bad_alloc simply means "something else went wrong other than an allocation failure." It does not necessarily mean a CUDA fatal exception occurred.

However this demonstrates that the CUDA errors are asynchronous, and we're not guaranteed that libcudf is the one processing the error directly (and thus translating it into a CUDA fatal exception). That means we really have no choice but to fallback to scraping the messages, looking for something that looks like a fatal CUDA error (i.e.: something like Tom's original change). We can still leverage the cudf exception types, but I don't see a way to avoid scraping the messages in the worst case. 😞

Yes, it seems no other way but scraping the error messages, in order to make sure that we catch as many fatal errors as possible. I added back Tom's original approach as a double check over CUDA errors thrown by cuDF/rmm, since libcudf may not be able to catch all fatal errors.

I was reminded that we can avoid scraping the error messages by using the same type of CUDA fatal error detection algorithm as implemented by libcudf to decide when to throw a fatal CUDA error. The problem with scraping the error message is that there's no guarantee that the information needed to discern a fatal CUDA error is in the error message string (since the error could have been detected and thrown by different code than libcudf).

So we can still leverage the cudf specific fatal exceptions, but for any other type of exception we should double-check the CUDA error state, using the same algorithm libcudf uses, to detect fatal CUDA errors not caught by libcudf code.

Hi @jlowe, I filed a cuDF PR (10884) to do it.

…a_error

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2022-05-12T06:21:51Z

build

sperlingxx · 2022-05-12T10:25:59Z

build

tgravescs · 2022-05-27T16:06:48Z

Please retarget for 22.08

…a_error

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2022-06-06T05:49:21Z

Hi @tgravescs, I retargeted to branch-22.08 and I removed the message scraping codes for double-check, since cuDF can fully take care of the fatal error detection with rapidsai/cudf#10884.

tgravescs · 2022-06-06T14:09:45Z

going to leave this for @jlowe for now since he is familiar with the cudf change

sperlingxx · 2022-06-07T01:49:41Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

Co-authored-by: Jason Lowe <[email protected]>

Signed-off-by: sperlingxx <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

Co-authored-by: Jason Lowe <[email protected]>

sperlingxx · 2022-06-08T16:25:52Z

build

refine the handling of task failure reason

6f28f52

Signed-off-by: sperlingxx <[email protected]>

sperlingxx requested review from jlowe and tgravescs April 28, 2022 03:58

fix style

30b042d

sameerz assigned sperlingxx Apr 28, 2022

sameerz added the feature request New feature or request label Apr 28, 2022

tgravescs reviewed Apr 28, 2022

View reviewed changes

update

6afd0cc

Signed-off-by: sperlingxx <[email protected]>

jlowe reviewed Apr 29, 2022

View reviewed changes

sperlingxx added 3 commits May 11, 2022 17:37

Merge remote-tracking branch 'origin/branch-22.06' into recognize_cud…

7adea66

…a_error

fix

3ae85a9

Signed-off-by: sperlingxx <[email protected]>

fix

4492061

sperlingxx added 2 commits June 6, 2022 13:33

Merge remote-tracking branch 'origin/branch-22.08' into recognize_cud…

e083451

…a_error

update

18f0dac

Signed-off-by: sperlingxx <[email protected]>

sperlingxx changed the base branch from branch-22.06 to branch-22.08 June 6, 2022 05:44

sperlingxx requested review from revans2, GaryShen2008, NvTimLiu and zhanga5 as code owners June 6, 2022 05:44

sperlingxx removed request for zhanga5, GaryShen2008 and NvTimLiu June 6, 2022 05:45

jlowe reviewed Jun 7, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala Outdated Show resolved Hide resolved

sameerz added this to the Jun 6 - Jun 17 milestone Jun 8, 2022

sperlingxx and others added 2 commits June 8, 2022 10:00

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

36a73f7

Co-authored-by: Jason Lowe <[email protected]>

update

7a9c6b6

Signed-off-by: sperlingxx <[email protected]>

jlowe previously approved these changes Jun 8, 2022

View reviewed changes

jlowe reviewed Jun 8, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala Outdated Show resolved Hide resolved

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

464db65

Co-authored-by: Jason Lowe <[email protected]>

sperlingxx dismissed jlowe’s stale review via 464db65 June 8, 2022 16:00

jlowe approved these changes Jun 8, 2022

View reviewed changes

sperlingxx merged commit 4f95734 into NVIDIA:branch-22.08 Jun 13, 2022

sperlingxx deleted the recognize_cuda_error branch June 13, 2022 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Halt Spark executor when encountering unrecoverable CUDA errors #5350

Halt Spark executor when encountering unrecoverable CUDA errors #5350

sperlingxx commented Apr 28, 2022

sperlingxx commented Apr 28, 2022

tgravescs Apr 28, 2022

jlowe Apr 28, 2022

sperlingxx Apr 29, 2022

sperlingxx commented Apr 29, 2022

jlowe Apr 29, 2022

jlowe Apr 29, 2022

sperlingxx May 10, 2022

jlowe May 10, 2022

sperlingxx May 11, 2022 •

edited

Loading

sperlingxx May 11, 2022

jlowe May 11, 2022

sperlingxx May 12, 2022

jlowe May 12, 2022

sperlingxx May 18, 2022

sperlingxx commented May 12, 2022

sperlingxx commented May 12, 2022

tgravescs commented May 27, 2022

sperlingxx commented Jun 6, 2022

tgravescs commented Jun 6, 2022

sperlingxx commented Jun 7, 2022

sperlingxx commented Jun 8, 2022

Halt Spark executor when encountering unrecoverable CUDA errors #5350

Halt Spark executor when encountering unrecoverable CUDA errors #5350

Conversation

sperlingxx commented Apr 28, 2022

sperlingxx commented Apr 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented Apr 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx commented May 12, 2022

sperlingxx commented May 12, 2022

tgravescs commented May 27, 2022

sperlingxx commented Jun 6, 2022

tgravescs commented Jun 6, 2022

sperlingxx commented Jun 7, 2022

sperlingxx commented Jun 8, 2022

sperlingxx May 11, 2022 •

edited

Loading