Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

Closed
Tracked by #8029
tgravescs opened this issue Mar 24, 2023 · 1 comment · Fixed by #8191
Closed
Tracked by #8029

[BUG] User app fails with OOM - GpuOutOfCoreSortIterator #7934

tgravescs opened this issue Mar 24, 2023 · 1 comment · Fixed by #8191
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
customer job is failing with the following OOM in GpuOutOfCoreSortIterator. I worked around by increasing the shuffle partitions.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-346-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:158: Maximum pool size exceeded
	at ai.rapids.cudf.Table.contiguousSplit(Native Method)
	at ai.rapids.cudf.Table.contiguousSplit(Table.java:2171)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.splitAfterSortAndSave(GpuSortExec.scala:358)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$7(GpuSortExec.scala:469)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$7$adapted(GpuSortExec.scala:468)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.withResource(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$2(GpuSortExec.scala:468)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$2$adapted(GpuSortExec.scala:440)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.withResource(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.mergeSortEnoughToOutput(GpuSortExec.scala:440)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$3(GpuSortExec.scala:520)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.withResource(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:519)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:246)
	at com.nvidia.spark.rapids.GpuKeyBatchingIterator.next(GpuKeyBatchingIterator.scala:157)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator$$anon$2.next(aggregate.scala:476)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator$$anon$2.next(aggregate.scala:468)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:247)
	at scala.Option.getOrElse(Option.scala:189)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:237)
	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:182)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.withResource(RapidsShuffleInternalManagerBase.scala:234)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.withResource(RapidsShuffleInternalManagerBase.scala:234)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 24, 2023
@mattahrens mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Mar 28, 2023
@revans2
Copy link
Collaborator

revans2 commented Mar 29, 2023

We probably need #7672 to allow this to work. We need to be able to split the data with a retry, but to be able to retry it we need to first have the input spillable. Without 7672 we would likely just move the failure to a contig split to make the input spillable before we do a contig split to split up the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants