-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable automatic SLURM Detection #6389
Comments
I think this is more of a bug/fix than an enhancement |
Sort of a duplicate of #6204 |
I said sort of because the linked issue comments also ask for a mechanism to disable SLURM detection even though the original post is about the connection between hpc checkpoints and |
I'm happy to implement that, although I've since come around on the automatic slurm detection (I think the right thing for me is to disable my other library's automatic detection). The simplest way forward is probably to add a |
That seems like the best course of action. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Could someone re-open this? This is still a problem |
What was the original intent of the SLURM detection? Would it make sense to maybe add a |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Does someone want to make a PR for this? I suppose I could |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Hey @amogkam, In PyTorch Lightning master, the SlurmConnector was refactored into a SignalConnector component. As a temporary solution, you could patch its Best, |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
@tchaton instead of having to patch it, is it possible to add a flag to disable it? |
We have users facing this issue with this as well. Namely, these users use https://github.com/facebookincubator/submitit . There is contention between both SubmitIt and Lightning for resubmitting jobs to the scheduler. My opinion is the Lightning library should not be in the business of requeuing jobs like this.
|
I think we should keep the feature, as it does seem to be used by the community (impression I've got from Slack questions and other issues). However, I think there's value in revamping the SLURM mechanism so that it does not leak into other components and can be easily switched on/off. This would be a similar idea to the addition of the |
+1 for adding an option to disable SLURM signal handling. I would actually vote for this feature to be opt-in rather than on by default, since this conflicts with how other packages such as Submitit handle pre-emption and requeuing. This has made it challenging to debug the source of issues around pre-emption. |
What if we let the cluster plugins configure signals? Would that help? This is a quick thought without much investigation. |
@awaelchli @carmocca since this support for resubmission is only for SLURM, we could either:
Whichever approach above is taken, the flag would need to be checked here: https://github.com/PyTorchLightning/pytorch-lightning/blob/ff3443fe42e9ad03e0604e50b0ae53f27ac2faac/pytorch_lightning/trainer/connectors/signal_connector.py#L52-L81 for whether we register the following this addition, we could determine whether to make this opt-in or opt-out |
Could the SLURM users comment on whether they prefer opt-in or opt-out, i.e., default |
Opt-out, but that's mostly because I changed my submitit workflow to not requeue because lightning was doing it |
Opt-out ( |
Ok so now I'm a little confused, I was reading opt-out as |
Yes @Queuecumber that's right. In summary:
|
That was my confusion, my apologies. I prefer |
Opened PR #10601. Feel free to test or review it :) |
For now we went with the non-breaking change of setting Trainer(plugins=[SLURMEnvironment(auto_requeue=False)]) (for use with submitit for example). If we get more feedback from the PL SLURM community about switching to If we were to flip the |
I also had this issue on a Compute Canada cluster where Lightning (version 1.8.1) by default tries to use invalid SLURM arguments (e.g. from pytorch_lightning.plugins.environments import SLURMEnvironment
class DisabledSLURMEnvironment(SLURMEnvironment):
def detect() -> bool:
return False
@staticmethod
def _validate_srun_used() -> None:
return
@staticmethod
def _validate_srun_variables() -> None:
return
# ...
trainer = Trainer(
accelerator="cpu", #
plugins=[DisabledSLURMEnvironment(auto_requeue=False)],
**kwargs,
) This seemed to result in Lightning running as it does on my local testing maching. |
Thanks @DM-Berger! I tried to run my training script within a Jupyter server terminal on the University of Arizona's HPC, and experienced exactly the same issue. With your solution, I just successfully ran my training. It seems good for now, and I'll make further comments if any error happens. |
Hey @DM-Berger and @MewmewWho When you request the SLURM machine, just set the job name to "bash" and then Lightning will bypass the SLURM detection. Then you don't need to make code changes. |
Thanks, @DM-Berger. I also faced a similar error (RuntimeError: You set --ntasks=8 in your SLURM bash script, but this variable is not supported. HINT: Use --ntasks-per-node=8 instead.), but it took me a while to find this page. I'm glad I finally found it, as it was very helpful in resolving the issue. |
From which PL version this feature is present? I tried to do it on 1.6.0 and it does not change anything. |
This doesn't work (anymore), as the original The solution is to monkey patch the detect method of from lightning.pytorch.plugins.environments import SLURMEnvironment
SLURMEnvironment.detect = lambda: False
trainer = Trainer(...) |
You can also let SLURM schedule a machine for you and then log in to the machine to run scripts manually. This is useful for development and debugging. If you set the job name to bash or interactive, and then log in and run scripts, Lightning’s SLURM auto-detection will get bypassed and it can launch processes normally. |
🚀 Feature
I want to run my PTL script on a SLURM cluster, but I don't want to use the built in SlurmConnector since I am using Ray to handle distributed training. When using Ray and the SlurmConnector, it results in this error: #3651. Is there a way I can disable automatic SLURM detection? Having an environment variable to disable this would really help. Thanks!
The text was updated successfully, but these errors were encountered: