HPC Save Writes Multiple Checkpoints #6204

Queuecumber · 2021-02-25T19:44:31Z

🐛 Bug

Currently the hpc_save function (https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L201) doesn't respect the behavior of save_last meaning that only one checkpoint should be written. This means that every time the job is preempted (on slurm) it writes another checkpoint which caused me to run out of disk space recently.

To Reproduce

This can't exactly be reproduced using the requested BoringModel method, it requires a cluster (I know for sure slurm will repro this), set a short timeout, and run. Each time the limit is reached there will be a new checkpoint written. If this should be controlled separately from the existing save_last flag, then another flag should be introduced. This should be an easy fix and I'd be happy to PR it if the owners are agreeable to the solution.

Expected behavior

Only one checkpoint is written.

Environment

CUDA:
- GPU:
- available: False
- version: 10.2
Packages:
- numpy: 1.20.1
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.2.0
- tqdm: 4.57.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.1
- version: Proposal for help #1 SMP Thu Jan 21 16:15:07 EST 2021

The text was updated successfully, but these errors were encountered:

tchaton · 2021-03-01T10:27:52Z

Hey @Queuecumber,

We are in the process of setting up a SLURM cluster to help with Slurm related bugs.
Any chance you would contribute a fix for this issue in the meanwhile ?

Best,
T.C

carmocca · 2021-03-01T13:03:54Z

It seems to me like all this logic should be integrated into ModelCheckpoint, so when training is over and on_train_end is called ModelCheckpoint is the one who does all this.

Like, why do we hpc_save manually here? https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L35

Queuecumber · 2021-03-01T14:26:30Z

LMK what you guys decide the best approach is and I can make the PR, are you sure that the model gets checkpointed using on_train_end if the job is preempted?

carmocca · 2021-03-01T14:50:36Z

We have this code to handle a KeyboardInterrupt and call on_train_end to save the state before shutting down

https://github.com/PyTorchLightning/pytorch-lightning/blob/925f082572500a8c3b97e1e8c9d614f6de73b232/pytorch_lightning/trainer/trainer.py#L631-L641

on_train_end calls:

https://github.com/PyTorchLightning/pytorch-lightning/blob/925f082572500a8c3b97e1e8c9d614f6de73b232/pytorch_lightning/trainer/training_loop.py#L132

and that checks ModelCheckpoint:

https://github.com/PyTorchLightning/pytorch-lightning/blob/925f082572500a8c3b97e1e8c9d614f6de73b232/pytorch_lightning/trainer/training_loop.py#L151-L162

Ideally, the SLURMConnector would somehow register handlers for SLURM signals to the loop so the shutting down logic is the same.

@awaelchli, you had a PR to refactor signal handling. Was it to do something like this? What happened to it?

Queuecumber · 2021-03-01T15:07:31Z

We could just have SLURMConnector call train_loop.on_train_end() for a simple solution

awaelchli · 2021-03-02T00:21:45Z

@carmocca Regarding signal handling, I had that a long time ago (#3632). It was not well received unfortunately and I closed it. There were also some problems I was not able to solve with the CI.

I plan to revisit it at some point and do it properly. We definitely need it, especially with multiprocessing which sometimes doesn't shut down properly and leaves ports open etc, because the signals are not correctly handled or only in some of the processes and others become zombies.

Queuecumber · 2021-03-13T15:02:35Z

Did you guys have any further thoughts on how to handle this? There's a conference deadline next week after than I plan to start on PRs for the two issues I filed

carmocca · 2021-03-14T01:03:38Z

Resolving this will be a bit complex, additionally we need to first setup SLURM testing so we can make sure everything is OK. I don't expect you to work on this for this reason.

We could just loop through the existing callbacks and see if any has save_last=True but this would break some people's workflow as they expect this SLURM checkpoint to always be created.

I'm re-labeling this as a feature as it is working as expected given the current design.

Queuecumber · 2021-03-14T01:11:12Z

What about something to disable the checkpointing and requeuing? I actually want that anyway because I'm using a slurm wrapper that takes care of all that already. Also I do have access to several slurm clusters I can test this on so I do have the resources to develop this as long as we can agree on the solution you guys want.

carmocca · 2021-03-14T01:22:46Z

If you want to disable it quick-n-dirty you could either monkey-patch the SlurmConnector.register_slurm_signal_handlers or resetting the signal hooks

https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L26-L27

Queuecumber · 2021-03-14T01:27:51Z

That works temporarily but I was proposing a PR for a supported way of disabling it

import-antigravity · 2021-03-16T01:20:34Z

If you want to disable it quick-n-dirty you could either monkey-patch the SlurmConnector.register_slurm_signal_handlers or resetting the signal hooks

https://github.com/PyTorchLightning/pytorch-lightning/blob/6e8721e7ae881cc54ec1f6580d85eb95507861e5/pytorch_lightning/trainer/connectors/slurm_connector.py#L26-L27

How would you reset the signal hooks here without editing the source?

Queuecumber · 2021-03-22T19:56:31Z

Resolving this will be a bit complex, additionally we need to first setup SLURM testing so we can make sure everything is OK. I don't expect you to work on this for this reason.

We could just loop through the existing callbacks and see if any has save_last=True but this would break some people's workflow as they expect this SLURM checkpoint to always be created.

I'm re-labeling this as a feature as it is working as expected given the current design.

I agree this is pretty complicated, there are some complex interactions (or maybe lack thereof) between the CheckpointConnector and the ModelCheckpoint callback which would need to be worked out, it probably makes sense to keep this logic in one place instead.

stale · 2021-04-23T04:56:23Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale · 2021-07-31T21:56:04Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Queuecumber added bug Something isn't working help wanted Open to be worked on labels Feb 25, 2021

edenlightning added checkpointing Related to checkpointing environment: slurm labels Feb 25, 2021

tchaton added the priority: 1 Medium priority task label Mar 1, 2021

carmocca added feature Is an improvement or enhancement and removed bug Something isn't working help wanted Open to be worked on labels Mar 14, 2021

carmocca mentioned this issue Mar 15, 2021

Disable automatic SLURM Detection #6389

Closed

stale bot added the won't fix This will not be worked on label Apr 23, 2021

awaelchli removed the won't fix This will not be worked on label Apr 23, 2021

awaelchli added this to the v1.4 milestone Apr 23, 2021

carmocca mentioned this issue May 11, 2021

Replace/deprecate SLURM specific checkpointing #5373

Closed

ananthsub mentioned this issue May 14, 2021

[wip] Test removing hpc save/load #7537

Closed

11 tasks

YannDubs mentioned this issue May 18, 2021

How to disable automatic SLURM detection / signal handling? #5225

Closed

edenlightning removed this from the v1.4 milestone Jul 1, 2021

stale bot added the won't fix This will not be worked on label Jul 31, 2021

stale bot closed this as completed Aug 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC Save Writes Multiple Checkpoints #6204

HPC Save Writes Multiple Checkpoints #6204

Queuecumber commented Feb 25, 2021

tchaton commented Mar 1, 2021

carmocca commented Mar 1, 2021

Queuecumber commented Mar 1, 2021

carmocca commented Mar 1, 2021

Queuecumber commented Mar 1, 2021

awaelchli commented Mar 2, 2021

Queuecumber commented Mar 13, 2021

carmocca commented Mar 14, 2021 •

edited

Loading

Queuecumber commented Mar 14, 2021

carmocca commented Mar 14, 2021

Queuecumber commented Mar 14, 2021

import-antigravity commented Mar 16, 2021

Queuecumber commented Mar 22, 2021 •

edited

Loading

stale bot commented Apr 23, 2021

stale bot commented Jul 31, 2021

HPC Save Writes Multiple Checkpoints #6204

HPC Save Writes Multiple Checkpoints #6204

Comments

Queuecumber commented Feb 25, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

tchaton commented Mar 1, 2021

carmocca commented Mar 1, 2021

Queuecumber commented Mar 1, 2021

carmocca commented Mar 1, 2021

Queuecumber commented Mar 1, 2021

awaelchli commented Mar 2, 2021

Queuecumber commented Mar 13, 2021

carmocca commented Mar 14, 2021 • edited Loading

Queuecumber commented Mar 14, 2021

carmocca commented Mar 14, 2021

Queuecumber commented Mar 14, 2021

import-antigravity commented Mar 16, 2021

Queuecumber commented Mar 22, 2021 • edited Loading

stale bot commented Apr 23, 2021

stale bot commented Jul 31, 2021

carmocca commented Mar 14, 2021 •

edited

Loading

Queuecumber commented Mar 22, 2021 •

edited

Loading