Disable automatic SLURM Detection #6389

amogkam · 2021-03-07T06:56:31Z

🚀 Feature

I want to run my PTL script on a SLURM cluster, but I don't want to use the built in SlurmConnector since I am using Ray to handle distributed training. When using Ray and the SlurmConnector, it results in this error: #3651. Is there a way I can disable automatic SLURM detection? Having an environment variable to disable this would really help. Thanks!

import-antigravity · 2021-03-08T16:14:22Z

I think this is more of a bug/fix than an enhancement

carmocca · 2021-03-15T19:29:23Z

Sort of a duplicate of #6204

import-antigravity · 2021-03-15T19:32:46Z

Sort of a duplicate of #6204

Does writing multiple checkpoints cause the ValueError: signal only works in main thread? Sorry I have only a passing familiarity with SLURM so I can't really speak intelligently about this. I saw the checkpoints mentioned in the stacktrace in #3651

carmocca · 2021-03-15T20:07:18Z

I said sort of because the linked issue comments also ask for a mechanism to disable SLURM detection even though the original post is about the connection between hpc checkpoints and ModelCheckpoint

Queuecumber · 2021-03-22T19:43:58Z

I'm happy to implement that, although I've since come around on the automatic slurm detection (I think the right thing for me is to disable my other library's automatic detection). The simplest way forward is probably to add a disable_hpc_detection flag to the trainer which is false by default.

import-antigravity · 2021-03-23T03:06:53Z

The simplest way forward is probably to add a disable_hpc_detection flag to the trainer which is false by default.

That seems like the best course of action.

stale · 2021-04-24T06:43:51Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

import-antigravity · 2021-08-02T01:38:17Z

Could someone re-open this? This is still a problem

import-antigravity · 2021-08-02T14:31:27Z

What was the original intent of the SLURM detection? Would it make sense to maybe add a Trainer flag to disable it?

stale · 2021-09-03T08:19:45Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

import-antigravity · 2021-09-03T14:59:42Z

Does someone want to make a PR for this? I suppose I could

stale · 2021-10-04T04:58:33Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

tchaton · 2021-10-06T07:58:49Z

Hey @amogkam,

In PyTorch Lightning master, the SlurmConnector was refactored into a SignalConnector component.

As a temporary solution, you could patch its register_signal_handlers to prevent the signal to be registered.

Best,
T.C

stale · 2021-11-06T04:31:23Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

amogkam · 2021-11-11T17:51:46Z

@tchaton instead of having to patch it, is it possible to add a flag to disable it?

ananthsub · 2021-11-17T05:33:33Z

We have users facing this issue with this as well. Namely, these users use https://github.com/facebookincubator/submitit . There is contention between both SubmitIt and Lightning for resubmitting jobs to the scheduler.

My opinion is the Lightning library should not be in the business of requeuing jobs like this.

The support for doing this with SLURM is adhoc. It likely precedes tools like submitit being available back when Will was running jobs on the FAIR cluster. My guess is Lightning wouldn't have built resubmission in if SubmitIt already existed.
Lightning doesn't do this for any other clusters. Why is SLURM special cased? Is Lightning saying that it will natively support resubmission for every other scheduler users want to integrate with? That won't scale to the number of schedulers/resubmitting scenarios that can exist.
Lightning runs in the context of a larger user application. The resubmission should happen at the application-level, not within the library like this.

carmocca · 2021-11-17T15:18:47Z

I think we should keep the feature, as it does seem to be used by the community (impression I've got from Slack questions and other issues).
We could recommend other tools but it looks cheap for us to remove everything and tell people to install another package if the functionality is already available.

However, I think there's value in revamping the SLURM mechanism so that it does not leak into other components and can be easily switched on/off.

This would be a similar idea to the addition of the LightningCLI (componentized, separated into a utility) even though we already had argparse-enabled components (from_argparse_args) which do leak into your Trainer/LightningModule/DataModule code.

marksibrahim · 2021-11-17T16:20:39Z

+1 for adding an option to disable SLURM signal handling. I would actually vote for this feature to be opt-in rather than on by default, since this conflicts with how other packages such as Submitit handle pre-emption and requeuing. This has made it challenging to debug the source of issues around pre-emption.

awaelchli · 2021-11-17T16:37:07Z

What if we let the cluster plugins configure signals? Would that help? This is a quick thought without much investigation.

ananthsub · 2021-11-17T16:49:19Z

@awaelchli @carmocca since this support for resubmission is only for SLURM, we could either:

Add a flag to the SLURMEnvironment class like auto_requeue. To preserve backward compatibility, this is by default True. Users would need to specify Trainer(..., plugins=[SLURMEnvironment(auto_requeue=False)], ...) in their code.
Or support an environment variable like PL_SLURM_ENABLE_AUTO_REQUEUE

Whichever approach above is taken, the flag would need to be checked here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ff3443fe42e9ad03e0604e50b0ae53f27ac2faac/pytorch_lightning/trainer/connectors/signal_connector.py#L52-L81

for whether we register the slurm_sigusr1_handler_fn or only disable job requeueing here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ff3443fe42e9ad03e0604e50b0ae53f27ac2faac/pytorch_lightning/trainer/connectors/signal_connector.py#L58-L81

following this addition, we could determine whether to make this opt-in or opt-out

awaelchli · 2021-11-17T23:00:55Z

Could the SLURM users comment on whether they prefer opt-in or opt-out, i.e., default auto_requeue=False vs True? @amogkam @import-antigravity @Queuecumber @marksibrahim

Queuecumber · 2021-11-17T23:08:38Z

Opt-out, but that's mostly because I changed my submitit workflow to not requeue because lightning was doing it

marksibrahim · 2021-11-17T23:16:04Z

Opt-out (auto_requeue=False) for me as well. I'd prefer to handle launching / requeuing outside of Lightning.

Queuecumber · 2021-11-17T23:29:01Z

Ok so now I'm a little confused, I was reading opt-out as auto_requeue=True by default (i.e., you need to opt-out of requeueing)

awaelchli · 2021-11-17T23:45:08Z

Yes @Queuecumber that's right.

In summary:

opt-out would be the explicit action of the user to not choose auto-requeue, i.e., they would have to set auto_requeue=False if they don't want it. Otherwise no action required (default is True).
opt-in would mean the user would explicitly have to set auto_requeue=True if they want it. Otherwise no action is required (default is False).

marksibrahim · 2021-11-18T02:20:30Z

That was my confusion, my apologies. I prefer auto_requeue=False to be the default. Both options are fine though, as long as there's a clear flag for the behavior.

awaelchli · 2021-11-18T16:24:09Z

Opened PR #10601. Feel free to test or review it :)

awaelchli · 2021-11-18T17:54:16Z

For now we went with the non-breaking change of setting auto_requeue=True. To turn it off, simply do

Trainer(plugins=[SLURMEnvironment(auto_requeue=False)])

(for use with submitit for example).

If we get more feedback from the PL SLURM community about switching to auto_requeue=False by default we can then make the decision. We will keep an eye out for SLURM users asking questions e.g. on slack and will be getting their feedback on this topic.

If we were to flip the auto_requeue bool default, it would most likely be a warning in PL v1.X that the default is about to change in v1.(X+2) and v1.(X+2) would be the breaking change.

DM-Berger · 2023-01-10T19:42:24Z

I also had this issue on a Compute Canada cluster where Lightning (version 1.8.1) by default tries to use invalid SLURM arguments (e.g. --ntasks instead of --ntasks-per-node and thus throws an error in interactive or non-interactive jobs. I was able to resolve the issue by subclassing the default SLURMEnvironment and over-riding all checks to make it look like no SLURM is available:

from pytorch_lightning.plugins.environments import SLURMEnvironment

class DisabledSLURMEnvironment(SLURMEnvironment):
    def detect() -> bool:
        return False

    @staticmethod
    def _validate_srun_used() -> None:
        return

    @staticmethod
    def _validate_srun_variables() -> None:
        return

# ...

trainer = Trainer(
    accelerator="cpu",  # 
    plugins=[DisabledSLURMEnvironment(auto_requeue=False)],
    **kwargs,
)

This seemed to result in Lightning running as it does on my local testing maching.

MewmewWho · 2023-03-15T03:02:47Z

Thanks @DM-Berger! I tried to run my training script within a Jupyter server terminal on the University of Arizona's HPC, and experienced exactly the same issue. With your solution, I just successfully ran my training. It seems good for now, and I'll make further comments if any error happens.

awaelchli · 2023-03-15T10:04:14Z

Hey @DM-Berger and @MewmewWho
I recently documented how to use Lightning with "interactive" mode here: https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode

When you request the SLURM machine, just set the job name to "bash" and then Lightning will bypass the SLURM detection. Then you don't need to make code changes.

fraolBatole · 2023-04-10T16:45:17Z

Thanks, @DM-Berger. I also faced a similar error (RuntimeError: You set --ntasks=8 in your SLURM bash script, but this variable is not supported. HINT: Use --ntasks-per-node=8 instead.), but it took me a while to find this page. I'm glad I finally found it, as it was very helpful in resolving the issue.

SavvaI · 2023-05-21T08:04:44Z

Hey @DM-Berger and @MewmewWho I recently documented how to use Lightning with "interactive" mode here: https://lightning.ai/docs/pytorch/latest/clouds/cluster_advanced.html#interactive-mode

When you request the SLURM machine, just set the job name to "bash" and then Lightning will bypass the SLURM detection. Then you don't need to make code changes.

From which PL version this feature is present? I tried to do it on 1.6.0 and it does not change anything.

see - Lightning-AI/pytorch-lightning#3651 - Lightning-AI/pytorch-lightning#6389 - ray-project/ray#10995

nikvaessen · 2024-03-14T09:45:29Z

I also had this issue on a Compute Canada cluster where Lightning (version 1.8.1) by default tries to use invalid SLURM arguments (e.g. --ntasks instead of --ntasks-per-node and thus throws an error in interactive or non-interactive jobs. I was able to resolve the issue by subclassing the default SLURMEnvironment and over-riding all checks to make it look like no SLURM is available:
from pytorch_lightning.plugins.environments import SLURMEnvironment

class DisabledSLURMEnvironment(SLURMEnvironment):
    def detect() -> bool:
        return False

    @staticmethod
    def _validate_srun_used() -> None:
        return

    @staticmethod
    def _validate_srun_variables() -> None:
        return

# ...

trainer = Trainer(
    accelerator="cpu",  # 
    plugins=[DisabledSLURMEnvironment(auto_requeue=False)],
    **kwargs,
)
This seemed to result in Lightning running as it does on my local testing maching.

This doesn't work (anymore), as the original SlurmEnvironment class is being used in

https://github.com/Lightning-AI/pytorch-lightning/blob/1439da41b25a1efac581db35433e19e9d2b736d2/src/lightning/pytorch/trainer/connectors/accelerator_connector.py#L395C1

The solution is to monkey patch the detect method of SLURMEnvironment:

from lightning.pytorch.plugins.environments import SLURMEnvironment
SLURMEnvironment.detect = lambda: False
trainer = Trainer(...)

shahin-trunk · 2024-06-03T07:52:32Z

You can also let SLURM schedule a machine for you and then log in to the machine to run scripts manually. This is useful for development and debugging. If you set the job name to bash or interactive, and then log in and run scripts, Lightning’s SLURM auto-detection will get bypassed and it can launch processes normally.

amogkam added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 7, 2021

carmocca added design Includes a design discussion environment: slurm labels Mar 24, 2021

stale bot added the won't fix This will not be worked on label Apr 24, 2021

stale bot closed this as completed May 1, 2021

YannDubs mentioned this issue May 18, 2021

How to disable automatic SLURM detection / signal handling? #5225

Closed

carmocca reopened this Aug 2, 2021

stale bot removed the won't fix This will not be worked on label Aug 2, 2021

stale bot added the won't fix This will not be worked on label Sep 3, 2021

stale bot removed the won't fix This will not be worked on label Sep 3, 2021

stale bot added the won't fix This will not be worked on label Oct 4, 2021

stale bot removed the won't fix This will not be worked on label Oct 6, 2021

tchaton added the priority: 1 Medium priority task label Oct 6, 2021

stale bot added the won't fix This will not be worked on label Nov 6, 2021

tchaton removed the priority: 1 Medium priority task label Nov 15, 2021

tchaton assigned awaelchli Nov 15, 2021

awaelchli mentioned this issue Nov 18, 2021

Control automatic resubmission on SLURM #10601

Merged

11 tasks

awaelchli closed this as completed in #10601 Nov 18, 2021

YannDubs mentioned this issue Sep 26, 2022

Lightning sends SIGTERM when using other SLURM manager #14893

Closed

nikvaessen mentioned this issue Nov 17, 2022

Disabling SLURM signalling prevents running code in a shell spawned with srun #15709

Closed

thesofakillers added a commit to thesofakillers/nlgoals that referenced this issue Jun 21, 2023

disable auto-requeu of slurm

d0cee04

see - Lightning-AI/pytorch-lightning#3651 - Lightning-AI/pytorch-lightning#6389 - ray-project/ray#10995

alexanderswerdlow mentioned this issue Jul 7, 2024

Using srun on a SLURM cluster causes the job to exit prematurely, even with auto_requeue=False #20056

Open

Disable automatic SLURM Detection #6389

Disable automatic SLURM Detection #6389

Comments

amogkam commented Mar 7, 2021 • edited Loading

🚀 Feature

import-antigravity commented Mar 8, 2021

carmocca commented Mar 15, 2021

import-antigravity commented Mar 15, 2021 • edited Loading

carmocca commented Mar 15, 2021 • edited Loading

Queuecumber commented Mar 22, 2021

import-antigravity commented Mar 23, 2021

stale bot commented Apr 24, 2021

import-antigravity commented Aug 2, 2021

import-antigravity commented Aug 2, 2021

stale bot commented Sep 3, 2021

import-antigravity commented Sep 3, 2021

stale bot commented Oct 4, 2021

tchaton commented Oct 6, 2021

stale bot commented Nov 6, 2021

amogkam commented Nov 11, 2021

ananthsub commented Nov 17, 2021 • edited Loading

carmocca commented Nov 17, 2021 • edited Loading

marksibrahim commented Nov 17, 2021 • edited Loading

awaelchli commented Nov 17, 2021 • edited Loading

ananthsub commented Nov 17, 2021 • edited Loading

awaelchli commented Nov 17, 2021

Queuecumber commented Nov 17, 2021

marksibrahim commented Nov 17, 2021

Queuecumber commented Nov 17, 2021

awaelchli commented Nov 17, 2021

marksibrahim commented Nov 18, 2021

awaelchli commented Nov 18, 2021

awaelchli commented Nov 18, 2021

DM-Berger commented Jan 10, 2023

MewmewWho commented Mar 15, 2023

awaelchli commented Mar 15, 2023

fraolBatole commented Apr 10, 2023

SavvaI commented May 21, 2023

nikvaessen commented Mar 14, 2024

shahin-trunk commented Jun 3, 2024

amogkam commented Mar 7, 2021 •

edited

Loading

import-antigravity commented Mar 15, 2021 •

edited

Loading

carmocca commented Mar 15, 2021 •

edited

Loading

ananthsub commented Nov 17, 2021 •

edited

Loading

carmocca commented Nov 17, 2021 •

edited

Loading

marksibrahim commented Nov 17, 2021 •

edited

Loading

awaelchli commented Nov 17, 2021 •

edited

Loading

ananthsub commented Nov 17, 2021 •

edited

Loading