Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] race condition while registering a buffer and spilling at the same time #643

Closed
abellina opened this issue Sep 2, 2020 · 0 comments · Fixed by #644
Closed

[BUG] race condition while registering a buffer and spilling at the same time #643

abellina opened this issue Sep 2, 2020 · 0 comments · Fixed by #644
Assignees
Labels
bug Something isn't working shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Sep 2, 2020

This issue was found internally, it looks like a race condition while adding a device buffer AND spilling:

20/09/02 13:43:43 ERROR RapidsDeviceMemoryStore: Error while adding, freeing the buffer ShuffleBufferId(shuffle_0_474_785,30785): 
java.lang.IllegalStateException: Buffer ID ShuffleBufferId(shuffle_0_474_785,30785) already registered host buffer size=544128
	at com.nvidia.spark.rapids.RapidsBufferCatalog.registerNewBuffer(RapidsBufferCatalog.scala:69)
	at com.nvidia.spark.rapids.RapidsBufferStore.addBuffer(RapidsBufferStore.scala:197)
	at com.nvidia.spark.rapids.RapidsDeviceMemoryStore.addTable(RapidsDeviceMemoryStore.scala:55)
	at org.apache.spark.sql.rapids.RapidsCachingWriter.$anonfun$write$1(RapidsShuffleInternalManager.scala:116)
	at org.apache.spark.sql.rapids.RapidsCachingWriter.$anonfun$write$1$adapted(RapidsShuffleInternalManager.scala:96)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.sql.rapids.RapidsCachingWriter.write(RapidsShuffleInternalManager.scala:96)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@jlowe already pointed me at the reason. It is that we are adding a buffer to the list of buffers, but not to the catalog in the same critical section, so a spill could happen before we let the catalog know, which by default registers it with the catalog as a new buffer. When the original catalog registration happens, we could run into the issue above.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin labels Sep 2, 2020
@abellina abellina self-assigned this Sep 2, 2020
@sameerz sameerz added this to the Aug 31 - Sep 11 milestone Sep 2, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Sep 29, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
* Revert "verify automerge fix of Token permission (NVIDIA#643)"

This reverts commit 8261117.

* Revert "try use new token to fix automerge permission"

This reverts commit 2a9acde.

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
* Update submodule cudf to f817d96d8bdc47da9fb2725d0e5a7b18586a29ee (NVIDIA#635)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>

* Fixing empty columns when casting to integer or decimal crashing (NVIDIA#633)

* fixing empty columns

Signed-off-by: Mike Wilson <[email protected]>

* cudf submodule commit to v22.10.00 (NVIDIA#640)

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

* try use new token to fix automerge permission

* verify automerge fix of Token permission (NVIDIA#643)

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

* Revert not working automerge fix [skip ci] (NVIDIA#644)

* Revert "verify automerge fix of Token permission (NVIDIA#643)"

This reverts commit 8261117.

* Revert "try use new token to fix automerge permission"

This reverts commit 2a9acde.

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: Peixin Li <[email protected]>

* Auto-merge use submodule in BASE ref

Signed-off-by: Peixin Li <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Signed-off-by: Mike Wilson <[email protected]>
Signed-off-by: Peixin Li <[email protected]>
Co-authored-by: Jenkins Automation <[email protected]>
Co-authored-by: Mike Wilson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants