[pipe] prevent deadlock with multiple evals sequence #1944
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR solves a race condition that leads to deadlocks at random times at eval@pipe.
Here is the diagnostics with tracebacks: https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md#2022-04-30-hanging-at-eval
The suspect code is:
DeepSpeed/deepspeed/runtime/pipe/engine.py
Lines 448 to 455 in a3b9003
I'm not 100% sure this is the best placement of the barrier but it solves the problem.
e.g. placing it 3 ops below after:
DeepSpeed/deepspeed/runtime/pipe/engine.py
Line 455 in a3b9003
didn't help.
So it's possible that some desyncing happening after L455 and before L448 when it comes in for a 2nd eval in a row.
@tjruwase, @jeffra