Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26713][CORE][followup] revert the partial fix in ShuffleBlockFetcherIterator #741

Merged
merged 2 commits into from
Mar 15, 2021

Conversation

rshkv
Copy link

@rshkv rshkv commented Mar 12, 2021

This is a clean cherry-pick of apache#25825, back-ported to our 2.x branch.

We're instances instances recently of task threads hanging while waiting for shuffle results. It seems related to SPARK-26713. It also matches the difference we have with upstream's latest 2.4.x release.

There is even a race condition with ShuffledRDD + PipedRDD: the ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin writer thread, which leaks memory.

It was easier to revert the commit that introduced the first fix, and then cherry-pick the combined back-port from here: apache#25825 (as opposed to just take the correction).

rshkv and others added 2 commits March 12, 2021 16:26
…ask is finished

### What changes were proposed in this pull request?
Manually release stdin writer and stderr reader thread when task is finished. This is the backport of apache#23638 including apache#25049.

### Why are the changes needed?
This is a bug fix. PipedRDD's IO threads may hang even the corresponding task is already finished. Without this fix,  it would leak resource(memory specially).

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Add new test

Closes apache#25825 from advancedxy/SPARK-26713_for_2.4.

Authored-by: Xianjin YE <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@rshkv rshkv requested a review from LorenzoMartini March 12, 2021 18:05
Comment on lines -413 to +406
while (!isZombie && result == null) {
while (result == null) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this thread correctly, we can get race conditions on the isZombie flag.

@@ -410,7 +403,7 @@ final class ShuffleBlockFetcherIterator(
// then fetch it one more time if it's corrupt, throw FailureFetchResult if the second fetch
// is also corrupt, so the previous stage could be retried.
// For local shuffle block, throw FailureFetchResult for the first IOException.
while (!isZombie && result == null) {
while (result == null) {
val startFetchWait = System.currentTimeMillis()
result = results.take()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we hang.

@LorenzoMartini LorenzoMartini changed the title [SPARK-26713][CORE][2.4] Interrupt pipe IO threads in PipedRDD when task is finished [SPARK-26713][CORE][followup] revert the partial fix in ShuffleBlockFetcherIterator Mar 15, 2021
@rshkv rshkv merged commit fcf43a2 into palantir-2.x Mar 15, 2021
@rshkv rshkv deleted the wr/spark-26713 branch March 15, 2021 11:29
@rshkv
Copy link
Author

rshkv commented Mar 15, 2021

Thank you, @LorenzoMartini.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants