Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable automatic SLURM Detection #6389

Closed
amogkam opened this issue Mar 7, 2021 · 36 comments · Fixed by #10601
Closed

Disable automatic SLURM Detection #6389

amogkam opened this issue Mar 7, 2021 · 36 comments · Fixed by #10601
Assignees
Labels
design Includes a design discussion environment: slurm feature Is an improvement or enhancement help wanted Open to be worked on priority: 0 High priority task

Comments

@amogkam
Copy link
Contributor

amogkam commented Mar 7, 2021

🚀 Feature

I want to run my PTL script on a SLURM cluster, but I don't want to use the built in SlurmConnector since I am using Ray to handle distributed training. When using Ray and the SlurmConnector, it results in this error: #3651. Is there a way I can disable automatic SLURM detection? Having an environment variable to disable this would really help. Thanks!

@amogkam amogkam added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 7, 2021
@import-antigravity
Copy link

I think this is more of a bug/fix than an enhancement

@carmocca
Copy link
Contributor

Sort of a duplicate of #6204

@import-antigravity
Copy link

import-antigravity commented Mar 15, 2021

Sort of a duplicate of #6204

Does writing multiple checkpoints cause the ValueError: signal only works in main thread? Sorry I have only a passing familiarity with SLURM so I can't really speak intelligently about this. I saw the checkpoints mentioned in the stacktrace in #3651

@carmocca
Copy link
Contributor

carmocca commented Mar 15, 2021

I said sort of because the linked issue comments also ask for a mechanism to disable SLURM detection even though the original post is about the connection between hpc checkpoints and ModelCheckpoint

@Queuecumber
Copy link
Contributor

I'm happy to implement that, although I've since come around on the automatic slurm detection (I think the right thing for me is to disable my other library's automatic detection). The simplest way forward is probably to add a disable_hpc_detection flag to the trainer which is false by default.

@import-antigravity
Copy link

The simplest way forward is probably to add a disable_hpc_detection flag to the trainer which is false by default.

That seems like the best course of action.

@carmocca carmocca added design Includes a design discussion environment: slurm labels Mar 24, 2021
@stale
Copy link

stale bot commented Apr 24, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 24, 2021
@stale stale bot closed this as completed May 1, 2021
@import-antigravity
Copy link

Could someone re-open this? This is still a problem

@carmocca carmocca reopened this Aug 2, 2021
@stale stale bot removed the won't fix This will not be worked on label Aug 2, 2021
@import-antigravity
Copy link

What was the original intent of the SLURM detection? Would it make sense to maybe add a Trainer flag to disable it?

@stale
Copy link

stale bot commented Sep 3, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Sep 3, 2021
@import-antigravity
Copy link

Does someone want to make a PR for this? I suppose I could

@stale stale bot removed the won't fix This will not be worked on label Sep 3, 2021
@stale
Copy link

stale bot commented Oct 4, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 4, 2021
@tchaton
Copy link
Contributor

tchaton commented Oct 6, 2021

Hey @amogkam,

In PyTorch Lightning master, the SlurmConnector was refactored into a SignalConnector component.

As a temporary solution, you could patch its register_signal_handlers to prevent the signal to be registered.

Best,
T.C

@stale stale bot removed the won't fix This will not be worked on label Oct 6, 2021
@tchaton tchaton added the priority: 1 Medium priority task label Oct 6, 2021
@stale
Copy link

stale bot commented Nov 6, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 6, 2021
@amogkam
Copy link
Contributor Author

amogkam commented Nov 11, 2021

@tchaton instead of having to patch it, is it possible to add a flag to disable it?

@tchaton tchaton removed the priority: 1 Medium priority task label Nov 15, 2021
@ananthsub
Copy link
Contributor

ananthsub commented Nov 17, 2021

We have users facing this issue with this as well. Namely, these users use https://github.com/facebookincubator/submitit . There is contention between both SubmitIt and Lightning for resubmitting jobs to the scheduler.

My opinion is the Lightning library should not be in the business of requeuing jobs like this.

  1. The support for doing this with SLURM is adhoc. It likely precedes tools like submitit being available back when Will was running jobs on the FAIR cluster. My guess is Lightning wouldn't have built resubmission in if SubmitIt already existed.
  2. Lightning doesn't do this for any other clusters. Why is SLURM special cased? Is Lightning saying that it will natively support resubmission for every other scheduler users want to integrate with? That won't scale to the number of schedulers/resubmitting scenarios that can exist.
  3. Lightning runs in the context of a larger user application. The resubmission should happen at the application-level, not within the library like this.

@carmocca
Copy link
Contributor

carmocca commented Nov 17, 2021

I think we should keep the feature, as it does seem to be used by the community (impression I've got from Slack questions and other issues).
We could recommend other tools but it looks cheap for us to remove everything and tell people to install another package if the functionality is already available.

However, I think there's value in revamping the SLURM mechanism so that it does not leak into other components and can be easily switched on/off.

This would be a similar idea to the addition of the LightningCLI (componentized, separated into a utility) even though we already had argparse-enabled components (from_argparse_args) which do leak into your Trainer/LightningModule/DataModule code.

@marksibrahim
Copy link

marksibrahim commented Nov 17, 2021

+1 for adding an option to disable SLURM signal handling. I would actually vote for this feature to be opt-in rather than on by default, since this conflicts with how other packages such as Submitit handle pre-emption and requeuing. This has made it challenging to debug the source of issues around pre-emption.

@awaelchli
Copy link
Contributor

awaelchli commented Nov 17, 2021

What if we let the cluster plugins configure signals? Would that help? This is a quick thought without much investigation.

@ananthsub
Copy link
Contributor

ananthsub commented Nov 17, 2021

@awaelchli @carmocca since this support for resubmission is only for SLURM, we could either:

  1. Add a flag to the SLURMEnvironment class like auto_requeue. To preserve backward compatibility, this is by default True. Users would need to specify Trainer(..., plugins=[SLURMEnvironment(auto_requeue=False)], ...) in their code.

  2. Or support an environment variable like PL_SLURM_ENABLE_AUTO_REQUEUE

Whichever approach above is taken, the flag would need to be checked here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ff3443fe42e9ad03e0604e50b0ae53f27ac2faac/pytorch_lightning/trainer/connectors/signal_connector.py#L52-L81

for whether we register the slurm_sigusr1_handler_fn or only disable job requeueing here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ff3443fe42e9ad03e0604e50b0ae53f27ac2faac/pytorch_lightning/trainer/connectors/signal_connector.py#L58-L81

following this addition, we could determine whether to make this opt-in or opt-out

@awaelchli
Copy link
Contributor

Could the SLURM users comment on whether they prefer opt-in or opt-out, i.e., default auto_requeue=False vs True? @amogkam @import-antigravity @Queuecumber @marksibrahim

@Queuecumber
Copy link
Contributor

Opt-out, but that's mostly because I changed my submitit workflow to not requeue because lightning was doing it

@marksibrahim
Copy link

Opt-out (auto_requeue=False) for me as well. I'd prefer to handle launching / requeuing outside of Lightning.

@Queuecumber
Copy link
Contributor

Ok so now I'm a little confused, I was reading opt-out as auto_requeue=True by default (i.e., you need to opt-out of requeueing)

@awaelchli
Copy link
Contributor

Yes @Queuecumber that's right.

In summary:

  1. opt-out would be the explicit action of the user to not choose auto-requeue, i.e., they would have to set auto_requeue=False if they don't want it. Otherwise no action required (default is True).
  2. opt-in would mean the user would explicitly have to set auto_requeue=True if they want it. Otherwise no action is required (default is False).

@marksibrahim
Copy link

That was my confusion, my apologies. I prefer auto_requeue=False to be the default. Both options are fine though, as long as there's a clear flag for the behavior.

@awaelchli
Copy link
Contributor

Opened PR #10601. Feel free to test or review it :)

@awaelchli
Copy link
Contributor

For now we went with the non-breaking change of setting auto_requeue=True. To turn it off, simply do

Trainer(plugins=[SLURMEnvironment(auto_requeue=False)])

(for use with submitit for example).

If we get more feedback from the PL SLURM community about switching to auto_requeue=False by default we can then make the decision. We will keep an eye out for SLURM users asking questions e.g. on slack and will be getting their feedback on this topic.

If we were to flip the auto_requeue bool default, it would most likely be a warning in PL v1.X that the default is about to change in v1.(X+2) and v1.(X+2) would be the breaking change.

@DM-Berger
Copy link

I also had this issue on a Compute Canada cluster where Lightning (version 1.8.1) by default tries to use invalid SLURM arguments (e.g. --ntasks instead of --ntasks-per-node and thus throws an error in interactive or non-interactive jobs. I was able to resolve the issue by subclassing the default SLURMEnvironment and over-riding all checks to make it look like no SLURM is available:

from pytorch_lightning.plugins.environments import SLURMEnvironment

class DisabledSLURMEnvironment(SLURMEnvironment):
    def detect() -> bool:
        return False

    @staticmethod
    def _validate_srun_used() -> None:
        return

    @staticmethod
    def _validate_srun_variables() -> None:
        return

# ...

trainer = Trainer(
    accelerator="cpu",  # 
    plugins=[DisabledSLURMEnvironment(auto_requeue=False)],
    **kwargs,
)

This seemed to result in Lightning running as it does on my local testing maching.

@MewmewWho
Copy link

Thanks @DM-Berger! I tried to run my training script within a Jupyter server terminal on the University of Arizona's HPC, and experienced exactly the same issue. With your solution, I just successfully ran my training. It seems good for now, and I'll make further comments if any error happens.

@awaelchli
Copy link
Contributor

Hey @DM-Berger and @MewmewWho
I recently documented how to use Lightning with "interactive" mode here: https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode

When you request the SLURM machine, just set the job name to "bash" and then Lightning will bypass the SLURM detection. Then you don't need to make code changes.

@fraolBatole
Copy link

Thanks, @DM-Berger. I also faced a similar error (RuntimeError: You set --ntasks=8 in your SLURM bash script, but this variable is not supported. HINT: Use --ntasks-per-node=8 instead.), but it took me a while to find this page. I'm glad I finally found it, as it was very helpful in resolving the issue.

@SavvaI
Copy link

SavvaI commented May 21, 2023

Hey @DM-Berger and @MewmewWho I recently documented how to use Lightning with "interactive" mode here: https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode

When you request the SLURM machine, just set the job name to "bash" and then Lightning will bypass the SLURM detection. Then you don't need to make code changes.

From which PL version this feature is present? I tried to do it on 1.6.0 and it does not change anything.

@nikvaessen
Copy link
Contributor

I also had this issue on a Compute Canada cluster where Lightning (version 1.8.1) by default tries to use invalid SLURM arguments (e.g. --ntasks instead of --ntasks-per-node and thus throws an error in interactive or non-interactive jobs. I was able to resolve the issue by subclassing the default SLURMEnvironment and over-riding all checks to make it look like no SLURM is available:

from pytorch_lightning.plugins.environments import SLURMEnvironment

class DisabledSLURMEnvironment(SLURMEnvironment):
    def detect() -> bool:
        return False

    @staticmethod
    def _validate_srun_used() -> None:
        return

    @staticmethod
    def _validate_srun_variables() -> None:
        return

# ...

trainer = Trainer(
    accelerator="cpu",  # 
    plugins=[DisabledSLURMEnvironment(auto_requeue=False)],
    **kwargs,
)

This seemed to result in Lightning running as it does on my local testing maching.

This doesn't work (anymore), as the original SlurmEnvironment class is being used in

https://github.com/Lightning-AI/pytorch-lightning/blob/1439da41b25a1efac581db35433e19e9d2b736d2/src/lightning/pytorch/trainer/connectors/accelerator_connector.py#L395C1

The solution is to monkey patch the detect method of SLURMEnvironment:

from lightning.pytorch.plugins.environments import SLURMEnvironment
SLURMEnvironment.detect = lambda: False
trainer = Trainer(...)

@shahin-trunk
Copy link

You can also let SLURM schedule a machine for you and then log in to the machine to run scripts manually. This is useful for development and debugging. If you set the job name to bash or interactive, and then log in and run scripts, Lightning’s SLURM auto-detection will get bypassed and it can launch processes normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion environment: slurm feature Is an improvement or enhancement help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.