You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
we saw intermittent failed OOM in shuffle_test,
rapids_premerge-github, build ID: 6762, 6759
failed in shuffle_test Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/RmmJni.cpp:445: cudaErrorMemoryAllocation out of memory
[2023-03-07T01:40:19.280Z] E : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 16.0 failed 1 times, most recent failure: Lost task 1.0 in stage 16.0 (TID 45) (10.233.91.183 executor 0): org.apache.spark.shuffle.rapids.RapidsShuffleFetchFailedException: Error getting client to fetch List((shuffle_6_42_1,490,0), (shuffle_6_43_1,490,1)) from BlockManagerId(1, 10.233.91.183, 46331, Some(rapids=7495))
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.$anonfun$start$3(RapidsShuffleIterator.scala:198)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.$anonfun$start$3$adapted(RapidsShuffleIterator.scala:163)
[2023-03-07T01:40:19.280Z] E at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[2023-03-07T01:40:19.280Z] E at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[2023-03-07T01:40:19.280Z] E at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.start(RapidsShuffleIterator.scala:163)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.next(RapidsShuffleIterator.scala:343)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.next(RapidsShuffleIterator.scala:50)
[2023-03-07T01:40:19.280Z] E at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-03-07T01:40:19.280Z] E at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:230)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
[2023-03-07T01:40:19.280Z] E at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:195)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:26)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:194)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:309)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:325)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$hasNext$2(aggregate.scala:564)
[2023-03-07T01:40:19.280Z] E at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
[2023-03-07T01:40:19.280Z] E at scala.Option.getOrElse(Option.scala:189)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.GpuHashAggregateIterator.hasNext(aggregate.scala:564)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:175)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.GpuRangePartitioner$.$anonfun$sketch$1(GpuRangePartitioner.scala:52)
[2023-03-07T01:40:19.280Z] E at com.nvidia.spark.rapids.GpuRangePartitioner$.$anonfun$sketch$1$adapted(GpuRangePartitioner.scala:49)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-03-07T01:40:19.280Z] E at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-03-07T01:40:19.280Z] E at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-03-07T01:40:19.280Z] E at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-03-07T01:40:19.281Z] E at java.lang.Thread.run(Thread.java:750)
[2023-03-07T01:40:19.281Z] E Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/RmmJni.cpp:445: cudaErrorMemoryAllocation out of memory
[2023-03-07T01:40:19.281Z] E at ai.rapids.cudf.Rmm.allocCudaInternal(Native Method)
[2023-03-07T01:40:19.281Z] E at ai.rapids.cudf.Rmm.allocCuda(Rmm.java:527)
[2023-03-07T01:40:19.281Z] E at ai.rapids.cudf.CudaMemoryBuffer.allocate(CudaMemoryBuffer.java:107)
[2023-03-07T01:40:19.281Z] E at ai.rapids.cudf.CudaMemoryBuffer.allocate(CudaMemoryBuffer.java:97)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.$anonfun$initBounceBufferPools$1(UCXShuffleTransport.scala:136)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.$anonfun$initBounceBufferPools$1$adapted(UCXShuffleTransport.scala:133)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.BounceBufferManager.<init>(BounceBufferManager.scala:96)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.initBounceBufferPools(UCXShuffleTransport.scala:147)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.ucx$lzycompute(UCXShuffleTransport.scala:76)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.ucx(UCXShuffleTransport.scala:70)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.connect(UCXShuffleTransport.scala:218)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.makeClient(UCXShuffleTransport.scala:270)
[2023-03-07T01:40:19.281Z] E at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.$anonfun$start$3(RapidsShuffleIterator.scala:187)
[2023-03-07T01:40:19.281Z] E ... 39 more
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Expected behavior
run pytest w/ -m shuffle_test
The text was updated successfully, but these errors were encountered:
Describe the bug
we saw intermittent failed OOM in shuffle_test,
rapids_premerge-github, build ID: 6762, 6759
failed in shuffle_test
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/RmmJni.cpp:445: cudaErrorMemoryAllocation out of memory
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Expected behavior
run pytest w/
-m shuffle_test
The text was updated successfully, but these errors were encountered: