Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pipe] prevent deadlock with multiple evals sequence #1944

Merged
merged 7 commits into from
May 10, 2022

Conversation

stas00
Copy link
Collaborator

@stas00 stas00 commented May 9, 2022

This PR solves a race condition that leads to deadlocks at random times at eval@pipe.

Here is the diagnostics with tracebacks: https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md#2022-04-30-hanging-at-eval

The suspect code is:

with torch.no_grad():
self._exec_schedule(sched)
if self.is_last_stage():
eval_output = self._reduce_outputs(self.fwd_outputs, reduce=reduce_output)
if compute_loss:
eval_output = self._bcast_pipe_scalar(eval_output)

I'm not 100% sure this is the best placement of the barrier but it solves the problem.

e.g. placing it 3 ops below after:

eval_output = self._bcast_pipe_scalar(eval_output)

didn't help.

So it's possible that some desyncing happening after L455 and before L448 when it comes in for a 2nd eval in a row.

@tjruwase, @jeffra

@jeffra
Copy link
Collaborator

jeffra commented May 9, 2022

@stas00 I just pushed a commit to your branch, I noticed DeepSpeedExamples in your branch was pointing to an older commit. Should be fixed now.

@stas00
Copy link
Collaborator Author

stas00 commented May 9, 2022

oh, weird, thank you for the fix, @jeffra - I think when I git pull - the sub-module doesn't get updated.

Have to run the hard to remember:

git submodule sync
git submodule update --init --recursive

@jeffra
Copy link
Collaborator

jeffra commented May 9, 2022

oh, weird, thank you for the fix, @jeffra - I think when I git pull - the sub-module doesn't get updated.

Have to run the hard to remember:

git submodule sync
git submodule update --init --recursive

Yeah, submodules are a bit annoying :( I think it might make sense to remove the DeepSpeedExamples submodule from our repo and just link to it in our docs. Now that we have multiple example repos (Megatron-DeepSpeed, DeepSpeedExamples, and a fairseq one coming soon) I am not sure it makes sense here anymore. Will sync w. @tjruwase and others to see how they feel though.

@stas00
Copy link
Collaborator Author

stas00 commented May 9, 2022

I totally agree.

git submodules are useful if the main repo relies on them but here it's in reverse.

Perhaps it makes sense to make deepspeed a submodule of DSE if that link is really needed ;)

@tjruwase tjruwase merged commit dbeadf1 into microsoft:master May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants