-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPC Save Writes Multiple Checkpoints #6204
Comments
Hey @Queuecumber, We are in the process of setting up a SLURM cluster to help with Slurm related bugs. Best, |
It seems to me like all this logic should be integrated into Like, why do we |
LMK what you guys decide the best approach is and I can make the PR, are you sure that the model gets checkpointed using |
We have this code to handle a
and that checks Ideally, the @awaelchli, you had a PR to refactor signal handling. Was it to do something like this? What happened to it? |
We could just have |
@carmocca Regarding signal handling, I had that a long time ago (#3632). It was not well received unfortunately and I closed it. There were also some problems I was not able to solve with the CI. I plan to revisit it at some point and do it properly. We definitely need it, especially with multiprocessing which sometimes doesn't shut down properly and leaves ports open etc, because the signals are not correctly handled or only in some of the processes and others become zombies. |
Did you guys have any further thoughts on how to handle this? There's a conference deadline next week after than I plan to start on PRs for the two issues I filed |
Resolving this will be a bit complex, additionally we need to first setup SLURM testing so we can make sure everything is OK. I don't expect you to work on this for this reason. We could just loop through the existing callbacks and see if any has I'm re-labeling this as a feature as it is working as expected given the current design. |
What about something to disable the checkpointing and requeuing? I actually want that anyway because I'm using a slurm wrapper that takes care of all that already. Also I do have access to several slurm clusters I can test this on so I do have the resources to develop this as long as we can agree on the solution you guys want. |
If you want to disable it quick-n-dirty you could either monkey-patch the |
That works temporarily but I was proposing a PR for a supported way of disabling it |
How would you reset the signal hooks here without editing the source? |
I agree this is pretty complicated, there are some complex interactions (or maybe lack thereof) between the |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
🐛 Bug
Currently the
hpc_save
function (https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L201) doesn't respect the behavior ofsave_last
meaning that only one checkpoint should be written. This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk space recently.To Reproduce
This can't exactly be reproduced using the requested BoringModel method, it requires a cluster (I know for sure slurm will repro this), set a short timeout, and run. Each time the limit is reached there will be a new checkpoint written. If this should be controlled separately from the existing
save_last
flag, then another flag should be introduced. This should be an easy fix and I'd be happy to PR it if the owners are agreeable to the solution.Expected behavior
Only one checkpoint is written.
Environment
- GPU:
- available: False
- version: 10.2
- numpy: 1.20.1
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.2.0
- tqdm: 4.57.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.1
- version: Proposal for help #1 SMP Thu Jan 21 16:15:07 EST 2021
The text was updated successfully, but these errors were encountered: