Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC Save Writes Multiple Checkpoints #6204

Closed
Queuecumber opened this issue Feb 25, 2021 · 15 comments
Closed

HPC Save Writes Multiple Checkpoints #6204

Queuecumber opened this issue Feb 25, 2021 · 15 comments
Labels
checkpointing Related to checkpointing environment: slurm feature Is an improvement or enhancement priority: 1 Medium priority task won't fix This will not be worked on

Comments

@Queuecumber
Copy link
Contributor

🐛 Bug

Currently the hpc_save function (https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L201) doesn't respect the behavior of save_last meaning that only one checkpoint should be written. This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk space recently.

To Reproduce

This can't exactly be reproduced using the requested BoringModel method, it requires a cluster (I know for sure slurm will repro this), set a short timeout, and run. Each time the limit is reached there will be a new checkpoint written. If this should be controlled separately from the existing save_last flag, then another flag should be introduced. This should be an easy fix and I'd be happy to PR it if the owners are agreeable to the solution.

Expected behavior

Only one checkpoint is written.

Environment

  • CUDA:
    - GPU:
    - available: False
    - version: 10.2
  • Packages:
    - numpy: 1.20.1
    - pyTorch_debug: False
    - pyTorch_version: 1.7.1
    - pytorch-lightning: 1.2.0
    - tqdm: 4.57.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.8.1
    - version: Proposal for help #1 SMP Thu Jan 21 16:15:07 EST 2021
@Queuecumber Queuecumber added bug Something isn't working help wanted Open to be worked on labels Feb 25, 2021
@tchaton tchaton added the priority: 1 Medium priority task label Mar 1, 2021
@tchaton
Copy link
Contributor

tchaton commented Mar 1, 2021

Hey @Queuecumber,

We are in the process of setting up a SLURM cluster to help with Slurm related bugs.
Any chance you would contribute a fix for this issue in the meanwhile ?

Best,
T.C

@carmocca
Copy link
Contributor

carmocca commented Mar 1, 2021

It seems to me like all this logic should be integrated into ModelCheckpoint, so when training is over and on_train_end is called ModelCheckpoint is the one who does all this.

Like, why do we hpc_save manually here? https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L35

@Queuecumber
Copy link
Contributor Author

LMK what you guys decide the best approach is and I can make the PR, are you sure that the model gets checkpointed using on_train_end if the job is preempted?

@carmocca
Copy link
Contributor

carmocca commented Mar 1, 2021

We have this code to handle a KeyboardInterrupt and call on_train_end to save the state before shutting down

https://github.com/PyTorchLightning/pytorch-lightning/blob/925f082572500a8c3b97e1e8c9d614f6de73b232/pytorch_lightning/trainer/trainer.py#L631-L641

on_train_end calls:

https://github.com/PyTorchLightning/pytorch-lightning/blob/925f082572500a8c3b97e1e8c9d614f6de73b232/pytorch_lightning/trainer/training_loop.py#L132

and that checks ModelCheckpoint:

https://github.com/PyTorchLightning/pytorch-lightning/blob/925f082572500a8c3b97e1e8c9d614f6de73b232/pytorch_lightning/trainer/training_loop.py#L151-L162

Ideally, the SLURMConnector would somehow register handlers for SLURM signals to the loop so the shutting down logic is the same.

@awaelchli, you had a PR to refactor signal handling. Was it to do something like this? What happened to it?

@Queuecumber
Copy link
Contributor Author

We could just have SLURMConnector call train_loop.on_train_end() for a simple solution

@awaelchli
Copy link
Contributor

@carmocca Regarding signal handling, I had that a long time ago (#3632). It was not well received unfortunately and I closed it. There were also some problems I was not able to solve with the CI.

I plan to revisit it at some point and do it properly. We definitely need it, especially with multiprocessing which sometimes doesn't shut down properly and leaves ports open etc, because the signals are not correctly handled or only in some of the processes and others become zombies.

@Queuecumber
Copy link
Contributor Author

Did you guys have any further thoughts on how to handle this? There's a conference deadline next week after than I plan to start on PRs for the two issues I filed

@carmocca
Copy link
Contributor

carmocca commented Mar 14, 2021

Resolving this will be a bit complex, additionally we need to first setup SLURM testing so we can make sure everything is OK. I don't expect you to work on this for this reason.

We could just loop through the existing callbacks and see if any has save_last=True but this would break some people's workflow as they expect this SLURM checkpoint to always be created.

I'm re-labeling this as a feature as it is working as expected given the current design.

@carmocca carmocca added feature Is an improvement or enhancement and removed bug Something isn't working help wanted Open to be worked on labels Mar 14, 2021
@Queuecumber
Copy link
Contributor Author

What about something to disable the checkpointing and requeuing? I actually want that anyway because I'm using a slurm wrapper that takes care of all that already. Also I do have access to several slurm clusters I can test this on so I do have the resources to develop this as long as we can agree on the solution you guys want.

@carmocca
Copy link
Contributor

If you want to disable it quick-n-dirty you could either monkey-patch the SlurmConnector.register_slurm_signal_handlers or resetting the signal hooks

https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L26-L27

@Queuecumber
Copy link
Contributor Author

That works temporarily but I was proposing a PR for a supported way of disabling it

@import-antigravity
Copy link

If you want to disable it quick-n-dirty you could either monkey-patch the SlurmConnector.register_slurm_signal_handlers or resetting the signal hooks

https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L26-L27

How would you reset the signal hooks here without editing the source?

@Queuecumber
Copy link
Contributor Author

Queuecumber commented Mar 22, 2021

Resolving this will be a bit complex, additionally we need to first setup SLURM testing so we can make sure everything is OK. I don't expect you to work on this for this reason.

We could just loop through the existing callbacks and see if any has save_last=True but this would break some people's workflow as they expect this SLURM checkpoint to always be created.

I'm re-labeling this as a feature as it is working as expected given the current design.

I agree this is pretty complicated, there are some complex interactions (or maybe lack thereof) between the CheckpointConnector and the ModelCheckpoint callback which would need to be worked out, it probably makes sense to keep this logic in one place instead.

@stale
Copy link

stale bot commented Apr 23, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 23, 2021
@awaelchli awaelchli removed the won't fix This will not be worked on label Apr 23, 2021
@awaelchli awaelchli added this to the v1.4 milestone Apr 23, 2021
@edenlightning edenlightning removed this from the v1.4 milestone Jul 1, 2021
@stale
Copy link

stale bot commented Jul 31, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jul 31, 2021
@stale stale bot closed this as completed Aug 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing environment: slurm feature Is an improvement or enhancement priority: 1 Medium priority task won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants