Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] shuffle_test test_hash_grpby_sum failed OOM in premerge CI #7855

Closed
pxLi opened this issue Mar 7, 2023 · 2 comments · Fixed by #7852
Closed

[BUG] shuffle_test test_hash_grpby_sum failed OOM in premerge CI #7855

pxLi opened this issue Mar 7, 2023 · 2 comments · Fixed by #7852
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Mar 7, 2023

Describe the bug
we saw intermittent failed OOM in shuffle_test,
rapids_premerge-github, build ID: 6762, 6759

failed in shuffle_test
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/RmmJni.cpp:445: cudaErrorMemoryAllocation out of memory

[2023-03-07T01:40:19.280Z] E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 16.0 failed 1 times, most recent failure: Lost task 1.0 in stage 16.0 (TID 45) (10.233.91.183 executor 0): org.apache.spark.shuffle.rapids.RapidsShuffleFetchFailedException: Error getting client to fetch List((shuffle_6_42_1,490,0), (shuffle_6_43_1,490,1)) from BlockManagerId(1, 10.233.91.183, 46331, Some(rapids=7495))

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.$anonfun$start$3(RapidsShuffleIterator.scala:198)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.$anonfun$start$3$adapted(RapidsShuffleIterator.scala:163)

[2023-03-07T01:40:19.280Z] E                   	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)

[2023-03-07T01:40:19.280Z] E                   	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)

[2023-03-07T01:40:19.280Z] E                   	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.start(RapidsShuffleIterator.scala:163)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.next(RapidsShuffleIterator.scala:343)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.next(RapidsShuffleIterator.scala:50)

[2023-03-07T01:40:19.280Z] E                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)

[2023-03-07T01:40:19.280Z] E                   	at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:230)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)

[2023-03-07T01:40:19.280Z] E                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$next$1(GpuExec.scala:195)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:26)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.CollectTimeIterator.next(GpuExec.scala:194)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.getHasOnDeck(GpuCoalesceBatches.scala:309)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:325)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$hasNext$2(aggregate.scala:564)

[2023-03-07T01:40:19.280Z] E                   	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)

[2023-03-07T01:40:19.280Z] E                   	at scala.Option.getOrElse(Option.scala:189)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.GpuHashAggregateIterator.hasNext(aggregate.scala:564)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:175)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.GpuRangePartitioner$.$anonfun$sketch$1(GpuRangePartitioner.scala:52)

[2023-03-07T01:40:19.280Z] E                   	at com.nvidia.spark.rapids.GpuRangePartitioner$.$anonfun$sketch$1$adapted(GpuRangePartitioner.scala:49)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:131)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)

[2023-03-07T01:40:19.280Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)

[2023-03-07T01:40:19.280Z] E                   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

[2023-03-07T01:40:19.280Z] E                   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[2023-03-07T01:40:19.281Z] E                   	at java.lang.Thread.run(Thread.java:750)

[2023-03-07T01:40:19.281Z] E                   Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-363-cuda11/thirdparty/cudf/java/src/main/native/src/RmmJni.cpp:445: cudaErrorMemoryAllocation out of memory

[2023-03-07T01:40:19.281Z] E                   	at ai.rapids.cudf.Rmm.allocCudaInternal(Native Method)

[2023-03-07T01:40:19.281Z] E                   	at ai.rapids.cudf.Rmm.allocCuda(Rmm.java:527)

[2023-03-07T01:40:19.281Z] E                   	at ai.rapids.cudf.CudaMemoryBuffer.allocate(CudaMemoryBuffer.java:107)

[2023-03-07T01:40:19.281Z] E                   	at ai.rapids.cudf.CudaMemoryBuffer.allocate(CudaMemoryBuffer.java:97)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.$anonfun$initBounceBufferPools$1(UCXShuffleTransport.scala:136)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.$anonfun$initBounceBufferPools$1$adapted(UCXShuffleTransport.scala:133)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.BounceBufferManager.<init>(BounceBufferManager.scala:96)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.initBounceBufferPools(UCXShuffleTransport.scala:147)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.ucx$lzycompute(UCXShuffleTransport.scala:76)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.ucx(UCXShuffleTransport.scala:70)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.connect(UCXShuffleTransport.scala:218)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.ucx.UCXShuffleTransport.makeClient(UCXShuffleTransport.scala:270)

[2023-03-07T01:40:19.281Z] E                   	at com.nvidia.spark.rapids.shuffle.RapidsShuffleIterator.$anonfun$start$3(RapidsShuffleIterator.scala:187)

[2023-03-07T01:40:19.281Z] E                   	... 39 more

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
run pytest w/ -m shuffle_test

@pxLi pxLi added bug Something isn't working test Only impacts tests labels Mar 7, 2023
@pxLi
Copy link
Collaborator Author

pxLi commented Mar 7, 2023

hi @abellina can you help check if rmm OOM is expected or not? Thanks!
If yes we may try limit parallelism for shuffle_smoke test or increase memory requirement https://github.com/NVIDIA/spark-rapids/blob/branch-23.04/integration_tests/run_pyspark_from_build.sh#L123-L125

@abellina
Copy link
Collaborator

abellina commented Mar 8, 2023

Handling it with this PR #7852 @pxLi. Thanks for filing it, I thought only my CI was seeing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants