-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hotfix] Synchronization upon Checkpointing #6185
[Hotfix] Synchronization upon Checkpointing #6185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you describe why the barrier here fixes it?
@@ -212,6 +212,7 @@ def save_checkpoint(self, trainer, pl_module): | |||
): | |||
return | |||
|
|||
trainer.accelerator.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, can you just add a comment to explain why this is done?
trainer.accelerator.barrier() | |
# Synchronize all processes if using distributed | |
trainer.accelerator.barrier() |
Just FYI @justusschock I'm not sure if you found the root cause, but I noticed that if I omit this broadcast, the reproducible code runs for me: |
Codecov Report
@@ Coverage Diff @@
## master #6185 +/- ##
========================================
- Coverage 93% 91% -2%
========================================
Files 116 159 +43
Lines 8833 11368 +2535
========================================
+ Hits 8235 10350 +2115
- Misses 598 1018 +420 |
The branch this PR is based off is very old (November). Can you rebase master? |
I can confirm that removing this line fixes the issue my case (with more advanced setup than in the reproduction steps: complex model, optimizer with custom scheduler and ddp). Would also like to know more what is the root cause. |
Will investigate further |
This adds a barrier before checkpointing.
Fixes #5604