You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was found internally, it looks like a race condition while adding a device buffer AND spilling:
20/09/02 13:43:43 ERROR RapidsDeviceMemoryStore: Error while adding, freeing the buffer ShuffleBufferId(shuffle_0_474_785,30785):
java.lang.IllegalStateException: Buffer ID ShuffleBufferId(shuffle_0_474_785,30785) already registered host buffer size=544128
at com.nvidia.spark.rapids.RapidsBufferCatalog.registerNewBuffer(RapidsBufferCatalog.scala:69)
at com.nvidia.spark.rapids.RapidsBufferStore.addBuffer(RapidsBufferStore.scala:197)
at com.nvidia.spark.rapids.RapidsDeviceMemoryStore.addTable(RapidsDeviceMemoryStore.scala:55)
at org.apache.spark.sql.rapids.RapidsCachingWriter.$anonfun$write$1(RapidsShuffleInternalManager.scala:116)
at org.apache.spark.sql.rapids.RapidsCachingWriter.$anonfun$write$1$adapted(RapidsShuffleInternalManager.scala:96)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at org.apache.spark.sql.rapids.RapidsCachingWriter.write(RapidsShuffleInternalManager.scala:96)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
@jlowe already pointed me at the reason. It is that we are adding a buffer to the list of buffers, but not to the catalog in the same critical section, so a spill could happen before we let the catalog know, which by default registers it with the catalog as a new buffer. When the original catalog registration happens, we could run into the issue above.
The text was updated successfully, but these errors were encountered:
* Revert "verify automerge fix of Token permission (NVIDIA#643)"
This reverts commit 8261117.
* Revert "try use new token to fix automerge permission"
This reverts commit 2a9acde.
Signed-off-by: Peixin Li <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
tgravescs
pushed a commit
to tgravescs/spark-rapids
that referenced
this issue
Nov 30, 2023
* Update submodule cudf to f817d96d8bdc47da9fb2725d0e5a7b18586a29ee (NVIDIA#635)
Signed-off-by: spark-rapids automation <[email protected]>
Signed-off-by: spark-rapids automation <[email protected]>
* Fixing empty columns when casting to integer or decimal crashing (NVIDIA#633)
* fixing empty columns
Signed-off-by: Mike Wilson <[email protected]>
* cudf submodule commit to v22.10.00 (NVIDIA#640)
Signed-off-by: Peixin Li <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
* try use new token to fix automerge permission
* verify automerge fix of Token permission (NVIDIA#643)
Signed-off-by: Peixin Li <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
* Revert not working automerge fix [skip ci] (NVIDIA#644)
* Revert "verify automerge fix of Token permission (NVIDIA#643)"
This reverts commit 8261117.
* Revert "try use new token to fix automerge permission"
This reverts commit 2a9acde.
Signed-off-by: Peixin Li <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
* Auto-merge use submodule in BASE ref
Signed-off-by: Peixin Li <[email protected]>
Signed-off-by: spark-rapids automation <[email protected]>
Signed-off-by: Mike Wilson <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
Co-authored-by: Jenkins Automation <[email protected]>
Co-authored-by: Mike Wilson <[email protected]>
This issue was found internally, it looks like a race condition while adding a device buffer AND spilling:
@jlowe already pointed me at the reason. It is that we are adding a buffer to the list of buffers, but not to the catalog in the same critical section, so a spill could happen before we let the catalog know, which by default registers it with the catalog as a new buffer. When the original catalog registration happens, we could run into the issue above.
The text was updated successfully, but these errors were encountered: