[Hotfix] Synchronization upon Checkpointing #6185

justusschock · 2021-02-24T17:56:59Z

This adds a barrier before checkpointing.

ananthsub

could you describe why the barrier here fixes it?

SeanNaren · 2021-02-24T18:04:53Z

pytorch_lightning/callbacks/model_checkpoint.py

@@ -212,6 +212,7 @@ def save_checkpoint(self, trainer, pl_module):
        ):
            return

+        trainer.accelerator.barrier()


Nice, can you just add a comment to explain why this is done?

Suggested change

trainer.accelerator.barrier()

# Synchronize all processes if using distributed

trainer.accelerator.barrier()

SeanNaren · 2021-02-24T18:06:26Z

Just FYI @justusschock I'm not sure if you found the root cause, but I noticed that if I omit this broadcast, the reproducible code runs for me:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/callbacks/model_checkpoint.py#L599

codecov · 2021-02-24T18:12:44Z

Codecov Report

Merging #6185 (bd50df7) into master (ee35907) will decrease coverage by 2%.
The diff coverage is 96%.

@@           Coverage Diff            @@
##           master   #6185     +/-   ##
========================================
- Coverage      93%     91%     -2%     
========================================
  Files         116     159     +43     
  Lines        8833   11368   +2535     
========================================
+ Hits         8235   10350   +2115     
- Misses        598    1018    +420

carmocca · 2021-02-24T18:14:26Z

The branch this PR is based off is very old (November). Can you rebase master?

marrrcin · 2021-02-24T18:35:25Z

Just FYI @justusschock I'm not sure if you found the root cause, but I noticed that if I omit this broadcast, the reproducible code runs for me:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/callbacks/model_checkpoint.py#L599

I can confirm that removing this line fixes the issue my case (with more advanced setup than in the reproduction steps: complex model, optimizer with custom scheduler and ddp). Would also like to know more what is the root cause.

justusschock · 2021-02-24T18:40:00Z

Will investigate further

add barrier before saving checkpoint

0c34be4

justusschock requested review from Borda, carmocca, tchaton and williamFalcon as code owners February 24, 2021 17:57

Update model_checkpoint.py

85a95f9

ananthsub reviewed Feb 24, 2021

View reviewed changes

SeanNaren reviewed Feb 24, 2021

View reviewed changes

justusschock closed this Feb 24, 2021

justusschock deleted the model_checkpoint_multi_gpu branch February 24, 2021 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hotfix] Synchronization upon Checkpointing #6185

[Hotfix] Synchronization upon Checkpointing #6185

justusschock commented Feb 24, 2021

ananthsub left a comment

SeanNaren Feb 24, 2021

SeanNaren commented Feb 24, 2021

codecov bot commented Feb 24, 2021 •

edited

Loading

carmocca commented Feb 24, 2021

marrrcin commented Feb 24, 2021

justusschock commented Feb 24, 2021

	trainer.accelerator.barrier()
	# Synchronize all processes if using distributed
	trainer.accelerator.barrier()

[Hotfix] Synchronization upon Checkpointing #6185

[Hotfix] Synchronization upon Checkpointing #6185

Conversation

justusschock commented Feb 24, 2021

ananthsub left a comment

Choose a reason for hiding this comment

SeanNaren Feb 24, 2021

Choose a reason for hiding this comment

SeanNaren commented Feb 24, 2021

codecov bot commented Feb 24, 2021 • edited Loading

Codecov Report

carmocca commented Feb 24, 2021

marrrcin commented Feb 24, 2021

justusschock commented Feb 24, 2021

codecov bot commented Feb 24, 2021 •

edited

Loading