Skip to content

Commit

Permalink
Retry with smaller split on CudfColumnSizeOverflowException
Browse files Browse the repository at this point in the history
Depends on rapidsai/cudf#13911.

When a CUDF operation causes a column's size to exceed the valid range
for CUDF columns (i.e. cudf::size_type), CUDF will throw an exception.

Prior to this commit, the `RmmRapidsRetryIterator` does not attempt retries
with smaller splits, in this case. Instead, the overflow is treated as
a generic exception.

This commit allows the RmmRapidsRetryIterator to recognize the exception
specific to the overflow case (i.e. `CudfColumnSizeOverflowException`),
and attempt a split-retry.

Note: This error condition is difficult to reproduce. The catch/retry is
a "best effort" attempt not to fail the entire task.

Signed-off-by: MithunR <[email protected]>
  • Loading branch information
mythrocks committed Aug 21, 2023
1 parent f723dfc commit 122370c
Showing 1 changed file with 10 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ package com.nvidia.spark.rapids

import scala.collection.mutable

import ai.rapids.cudf.CudfColumnSizeOverflowException

import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource}
import com.nvidia.spark.rapids.RapidsPluginImplicits._
import com.nvidia.spark.rapids.ScalableTaskCompletion.onTaskCompletion
Expand Down Expand Up @@ -580,9 +582,14 @@ object RmmRapidsRetryIterator extends Logging {
lastException = ex

if (!topLevelIsRetry && !causedByRetry) {
// we want to throw early here, since we got an exception
// we were not prepared to handle
throw lastException
// If the exception is the result of a CUDF column size overflow, attempt split-retry.
ex match {
case _: CudfColumnSizeOverflowException => doSplit = true
case _ =>
// we want to throw early here, since we got an exception
// we were not prepared to handle
throw lastException
}
}
// else another exception wrapped a retry. So we are going to try again
}
Expand Down

0 comments on commit 122370c

Please sign in to comment.