Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] INC AFTER CLOSE for ColumnVector during shutdown in the join code #7581

Closed
abellina opened this issue Jan 25, 2023 · 5 comments
Closed
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

In the AbstractGpuJoinIterator/JoinGatherer code, I am seeing INC AFTER CLOSE when an OOM exception occurs and we are shutting down. This was seen in our performance cluster, with regular settings except setting: spark.rapids.memory.gpu.allocFraction=0.34.

Here is one of the INC stack traces, showing what ColumnVector was problematic, but with memory debug settings we get a lot more output showing all the INC/DEC stacks, and we'd need that on to debug further.

Executor task launch worker for task 76.0 in stage 44.0 (TID 12577) 23/01/25 17:38:50:950 ERROR MemoryCleaner: INC AFTER CLOSE ColumnVector{rows=17977343, type=INT32, nullCount=Optional.empty, offHeap=(ID: 8773 0)} (ID: 8773): 2023-01-25 17:38:50.0376 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:333)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:91)
ai.rapids.cudf.ColumnVector.incRefCountInternal(ColumnVector.java:251)
ai.rapids.cudf.ColumnVector.<init>(ColumnVector.java:62)
ai.rapids.cudf.Table.<init>(Table.java:89)
ai.rapids.cudf.Table.gather(Table.java:2400)
com.nvidia.spark.rapids.JoinGathererImpl.$anonfun$gatherNext$2(JoinGatherer.scala:537)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.rapids.JoinGathererImpl.withResource(JoinGatherer.scala:497)
com.nvidia.spark.rapids.JoinGathererImpl.$anonfun$gatherNext$1(JoinGatherer.scala:536)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.rapids.JoinGathererImpl.withResource(JoinGatherer.scala:497)
com.nvidia.spark.rapids.JoinGathererImpl.gatherNext(JoinGatherer.scala:534)
com.nvidia.spark.rapids.MultiJoinGather.gatherNext(JoinGatherer.scala:605)
com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$2(AbstractGpuJoinIterator.scala:137)
scala.Option.map(Option.scala:230)
com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$nextCbFromGatherer$1(AbstractGpuJoinIterator.scala:135)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.rapids.AbstractGpuJoinIterator.withResource(AbstractGpuJoinIterator.scala:53)
com.nvidia.spark.rapids.AbstractGpuJoinIterator.nextCbFromGatherer(AbstractGpuJoinIterator.scala:134)
com.nvidia.spark.rapids.AbstractGpuJoinIterator.$anonfun$hasNext$5(AbstractGpuJoinIterator.scala:97)
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
com.nvidia.spark.rapids.GpuMetric.ns(GpuExec.scala:166)
com.nvidia.spark.rapids.AbstractGpuJoinIterator.hasNext(AbstractGpuJoinIterator.scala:97)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1(GpuExec.scala:193)
com.nvidia.spark.rapids.CollectTimeIterator.$anonfun$hasNext$1$adapted(GpuExec.scala:192)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.RebaseHelper$.withResource(RebaseHelper.scala:26)
com.nvidia.spark.rapids.CollectTimeIterator.hasNext(GpuExec.scala:192)
scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1089)
com.nvidia.spark.rapids.CloseableBufferedIterator.hasNext(CloseableBufferedIterator.scala:38)
com.nvidia.spark.rapids.GpuBroadcastHashJoinExec.$anonfun$getBroadcastBuiltBatchAndStreamIter$2(GpuBroadcastHashJoinExec.scala:172)
com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
com.nvidia.spark.rapids.GpuBroadcastHashJoinExec.withResource(GpuBroadcastHashJoinExec.scala:107)
com.nvidia.spark.rapids.GpuBroadcastHashJoinExec.$anonfun$getBroadcastBuiltBatchAndStreamIter$1(GpuBroadcastHashJoinExec.scala:171)
com.nvidia.spark.rapids.Arm.closeOnExcept(Arm.scala:87)
com.nvidia.spark.rapids.Arm.closeOnExcept$(Arm.scala:85)
com.nvidia.spark.rapids.GpuBroadcastHashJoinExec.closeOnExcept(GpuBroadcastHashJoinExec.scala:107)
com.nvidia.spark.rapids.GpuBroadcastHashJoinExec.getBroadcastBuiltBatchAndStreamIter(GpuBroadcastHashJoinExec.scala:170)
com.nvidia.spark.rapids.GpuBroadcastHashJoinExec.$anonfun$doExecuteColumnar$1(GpuBroadcastHashJoinExec.scala:207)
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
org.apache.spark.scheduler.Task.run(Task.scala:131)
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 25, 2023
@sameerz sameerz added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Jan 31, 2023
@sameerz
Copy link
Collaborator

sameerz commented Jan 31, 2023

Need to relook at this after addressing issue #7255

@jbrennan333
Copy link
Contributor

I think it is highly likely that this was fixed by #7902, in particular the changes to the allowSpilling methods in JoinGatherer to ensure that they clear the cached cb if the attempt to spill fails.
@abellina are you ok with closing this?

@abellina
Copy link
Collaborator Author

abellina commented Apr 4, 2023

It would be worth re-running NDS at 3TB with the restricted config with the INC/DEC debug config. I didn't call the specific test here, but given my configs in that cluster it was this benchmark.

@jbrennan333
Copy link
Contributor

I ran NDS at 3TB with 0.34 gpu memory, and did not see a recurrence of this. I also did not see any task failures, so I tried running with GPU memory at 6gb, and still did not any INC after CLOSEs in the join code. I think this one is fixed.

@abellina
Copy link
Collaborator Author

abellina commented Apr 5, 2023

Thanks for running it @jbrennan333.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

3 participants