Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hotfix] Synchronization upon Checkpointing #6185

Closed
wants to merge 2 commits into from
Closed

[Hotfix] Synchronization upon Checkpointing #6185

wants to merge 2 commits into from

Conversation

justusschock
Copy link
Member

This adds a barrier before checkpointing.

Fixes #5604

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you describe why the barrier here fixes it?

@@ -212,6 +212,7 @@ def save_checkpoint(self, trainer, pl_module):
):
return

trainer.accelerator.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, can you just add a comment to explain why this is done?

Suggested change
trainer.accelerator.barrier()
# Synchronize all processes if using distributed
trainer.accelerator.barrier()

@SeanNaren
Copy link
Contributor

Just FYI @justusschock I'm not sure if you found the root cause, but I noticed that if I omit this broadcast, the reproducible code runs for me:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/callbacks/model_checkpoint.py#L599

@codecov
Copy link

codecov bot commented Feb 24, 2021

Codecov Report

Merging #6185 (bd50df7) into master (ee35907) will decrease coverage by 2%.
The diff coverage is 96%.

@@           Coverage Diff            @@
##           master   #6185     +/-   ##
========================================
- Coverage      93%     91%     -2%     
========================================
  Files         116     159     +43     
  Lines        8833   11368   +2535     
========================================
+ Hits         8235   10350   +2115     
- Misses        598    1018    +420     

@carmocca
Copy link
Contributor

The branch this PR is based off is very old (November). Can you rebase master?

@marrrcin
Copy link

Just FYI @justusschock I'm not sure if you found the root cause, but I noticed that if I omit this broadcast, the reproducible code runs for me:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/callbacks/model_checkpoint.py#L599

I can confirm that removing this line fixes the issue my case (with more advanced setup than in the reproduction steps: complex model, optimizer with custom scheduler and ddp). Would also like to know more what is the root cause.

@justusschock justusschock deleted the model_checkpoint_multi_gpu branch February 24, 2021 18:39
@justusschock
Copy link
Member Author

Will investigate further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Training is interrupted without error with MulitGPU
5 participants