[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

tgravescs · 2023-03-24T16:09:40Z

Describe the bug
customer job is failing with the following OOM in GpuOutOfCoreSortIterator. I worked around by increasing the shuffle partitions.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-346-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:158: Maximum pool size exceeded
	at ai.rapids.cudf.Table.contiguousSplit(Native Method)
	at ai.rapids.cudf.Table.contiguousSplit(Table.java:2171)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.splitAfterSortAndSave(GpuSortExec.scala:358)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$7(GpuSortExec.scala:469)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$7$adapted(GpuSortExec.scala:468)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.withResource(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$2(GpuSortExec.scala:468)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$2$adapted(GpuSortExec.scala:440)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.withResource(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.mergeSortEnoughToOutput(GpuSortExec.scala:440)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$3(GpuSortExec.scala:520)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.withResource(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:519)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuKeyBatchingIterator.next(GpuKeyBatchingIterator.scala:157)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator$$anon$2.next(aggregate.scala:476)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator$$anon$2.next(aggregate.scala:468)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:247)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:237)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:182)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.withResource(RapidsShuffleInternalManagerBase.scala:234)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.withResource(RapidsShuffleInternalManagerBase.scala:234)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

The text was updated successfully, but these errors were encountered:

revans2 · 2023-03-29T18:58:47Z

We probably need #7672 to allow this to work. We need to be able to split the data with a retry, but to be able to retry it we need to first have the input spillable. Without 7672 we would likely just move the failure to a contig split to make the input spillable before we do a contig split to split up the data.

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 24, 2023

tgravescs mentioned this issue Mar 24, 2023

[FEA] Avoid memory over usage on GPU nodes in the SparkPlan #7252

Closed

7 tasks

mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Mar 28, 2023

revans2 mentioned this issue Apr 4, 2023

[FEA] Add Retry/SplitAndRetry for First round of high priority operators #8029

Closed

6 tasks

firestarman self-assigned this Apr 13, 2023

firestarman mentioned this issue Apr 27, 2023

Add in retry-work to GPU OutOfCore Sort #8191

Merged

1 task

firestarman closed this as completed in #8191 May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

tgravescs commented Mar 24, 2023

revans2 commented Mar 29, 2023

[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

Comments

tgravescs commented Mar 24, 2023

revans2 commented Mar 29, 2023