Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mixed blocking spill test timeout #9671

Open
jlowe opened this issue Nov 9, 2023 · 2 comments
Open

[BUG] mixed blocking spill test timeout #9671

jlowe opened this issue Nov 9, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Contributor

jlowe commented Nov 9, 2023

From a recently nightly test run:

[2023-11-09T13:19:45.093Z] - mixed blocking alloc with spill *** FAILED ***
[2023-11-09T13:19:45.095Z]   java.util.concurrent.TimeoutException:
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite$TaskThread$TaskThreadTrackingOp.get(HostAllocSuite.scala:107)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread.waitForAlloc(HostAllocSuite.scala:218)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$50(HostAllocSuite.scala:705)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$50$adapted(HostAllocSuite.scala:703)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$49(HostAllocSuite.scala:703)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$49$adapted(HostAllocSuite.scala:699)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$48(HostAllocSuite.scala:699)
[2023-11-09T13:19:45.095Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$48$adapted(HostAllocSuite.scala:690)
[2023-11-09T13:19:45.095Z]   ...
[2023-11-09T13:19:45.658Z] *** RUN ABORTED ***
[2023-11-09T13:19:45.658Z]   java.lang.AssertionError: Leaked 1 pinned allocations
[2023-11-09T13:19:45.658Z]   at ai.rapids.cudf.PinnedMemoryPool.close(PinnedMemoryPool.java:317)
[2023-11-09T13:19:45.658Z]   at ai.rapids.cudf.PinnedMemoryPool.shutdown(PinnedMemoryPool.java:217)
[2023-11-09T13:19:45.658Z]   at com.nvidia.spark.rapids.HostAllocSuite.beforeEach(HostAllocSuite.scala:382)
[2023-11-09T13:19:45.658Z]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:233)
[2023-11-09T13:19:45.658Z]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[2023-11-09T13:19:45.658Z]   at com.nvidia.spark.rapids.HostAllocSuite.runTest(HostAllocSuite.scala:34)
[2023-11-09T13:19:45.658Z]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[2023-11-09T13:19:45.658Z]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[2023-11-09T13:19:45.658Z]   at scala.collection.immutable.List.foreach(List.scala:431)
[2023-11-09T13:19:45.658Z]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[2023-11-09T13:19:45.658Z]   ...
[2023-11-09T13:19:45.658Z] 23/11/09 13:19:45.602 Thread-2 ERROR PinnedMemoryPool: A PINNED HOST BUFFER WAS LEAKED (ID: 42 7f3aef000000)
[2023-11-09T13:19:45.658Z] 23/11/09 13:19:45.610 Thread-2 ERROR MemoryCleaner: Leaked pinned host buffer (ID: 42): 2023-11-09 13:19:44.0512 GMT: INC
[2023-11-09T13:19:45.658Z] java.lang.Thread.getStackTrace(Thread.java:1564)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:275)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:117)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.PinnedMemoryPool.tryAllocateInternal(PinnedMemoryPool.java:372)
[2023-11-09T13:19:45.658Z] ai.rapids.cudf.PinnedMemoryPool.tryAllocate(PinnedMemoryPool.java:233)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAlloc.tryAllocPinned(HostAlloc.scala:238)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAlloc.$anonfun$tryAlloc$1(HostAlloc.scala:308)
[2023-11-09T13:19:45.658Z] scala.Option.orElse(Option.scala:447)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAlloc.tryAlloc(HostAlloc.scala:305)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAlloc.alloc(HostAlloc.scala:317)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAlloc$.alloc(HostAlloc.scala:418)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread.com$nvidia$spark$rapids$HostAllocSuite$AllocOnAnotherThread$$doAlloc(HostAllocSuite.scala:265)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread$$anon$2.doIt(HostAllocSuite.scala:207)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread$$anon$2.doIt(HostAllocSuite.scala:205)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAllocSuite$TaskThread$TaskThreadTrackingOp.doIt(HostAllocSuite.scala:77)
[2023-11-09T13:19:45.658Z] com.nvidia.spark.rapids.HostAllocSuite$TaskThread.run(HostAllocSuite.scala:184)
@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 9, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 14, 2023
@res-life
Copy link
Collaborator

I also met this today, it's a random problem.
We may need to increase the timeout.
From Liangcai:

Is this related to the compression/decompression for disk spilling which is added recently ?

@jlowe
Copy link
Contributor Author

jlowe commented Dec 1, 2023

A similar test recently failed in a nightly test, not sure if it's the same root cause. Happy to file a separate issue if deemed likely separate:

[2023-12-01T18:19:36.815Z] - simple mixed blocking alloc *** FAILED ***
[2023-12-01T18:19:36.815Z]   java.util.concurrent.TimeoutException:
[2023-12-01T18:19:36.815Z]   at com.nvidia.spark.rapids.HostAllocSuite$TaskThread$TaskThreadTrackingOp.get(HostAllocSuite.scala:111)
[2023-12-01T18:19:36.815Z]   at com.nvidia.spark.rapids.HostAllocSuite$AllocOnAnotherThread.waitForAlloc(HostAllocSuite.scala:220)
[2023-12-01T18:19:36.815Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$34(HostAllocSuite.scala:530)
[2023-12-01T18:19:36.815Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$34$adapted(HostAllocSuite.scala:524)
[2023-12-01T18:19:36.815Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-01T18:19:36.815Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$33(HostAllocSuite.scala:524)
[2023-12-01T18:19:36.816Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$33$adapted(HostAllocSuite.scala:520)
[2023-12-01T18:19:36.816Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-01T18:19:36.816Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$32(HostAllocSuite.scala:520)
[2023-12-01T18:19:36.816Z]   at com.nvidia.spark.rapids.HostAllocSuite.$anonfun$new$32$adapted(HostAllocSuite.scala:516)
[2023-12-01T18:19:36.816Z]   ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants