Automated Slurm setup for development, testing and training.
Slurm docs • Docs • Discord • Synpse Platform
- What is Slurm?
- Deployment options
- Prerequisites
- Install synpse agent
- Starting Slurm
- Connecting to Slurm
- Running a Slurm job
- Hacking
- Troubleshooting
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Slurm architecture:
So, it's not that easy to get it up and running, hence this guide!
This repository contains instructions and tooling to get you a Slurm environment where you can experiment with working sbatch
, srun
, sinfo
and sacct
commands.
mini
- JupyterLab, master, compute nodes and storage. Can run srun, sbatch, squeue, sacct.multi
- (coming soon) All + multi-node.auto
- (coming soon) automated Slurm operator that can automatically add more nodes into the cluster, manages configuration, shared storage and DNS.
- Linux VM, can be your own PC, university server or a cloud machine from your favorite provider such as Google Cloud Platform, AWS, Azure, Hetzner, etc.
- Synpse account - free
While the installation is straightforward, I will try to keep it verbose and simple. Our three main steps are:
- Setting up Synpse agent on your machine (it will deploy and run Slurm)
- Start Slurm
- Use the synpse CLI to securely connect to Jupyter head node so we can run Slurm jobs
For cloud, first go to the "Provisioning" page:
And then click on the "Provision cloud VM":
Insert this into the clout-init steps. Otherwise, if you already have the machine then use the steps from the "Option 2".
With on-prem machines you can just go to the "Provisioning", then click on the "Provision device" and copy the command.
Now, SSH into your machine and run the command.
Once the device is online:
Go to the labels section and add a label: "type": "server". Synpse starts applications based on labels and this is what we have in our template.
Then, change the device name to mini-slurm
, this will be helpful later.
Click on the "default" namespace on the left and then go to the "secrets":
Here, create two secrets:
slurmConf
with contents from https://github.com/synpse-hq/slurm-cluster/blob/main/mini/slurm.confslurmdbdConf
with contents from https://github.com/synpse-hq/slurm-cluster/blob/main/mini/slurmdbd.conf
You can change the configuration here, it will be available to all Slurm components. You can find more info about slurm.conf here https://slurm.schedmd.com/slurm.conf.html.
Once you have the configuration secrets added, click on the "default" namespace again and then click "new application". Delete the existing example configuration and copy the yaml from this file https://github.com/synpse-hq/slurm-cluster/blob/main/mini/mini-slurm.yaml and click "Deploy".
This will create:
- A Slurm head node with Jupyter from which you can launch jobs
- Slurm controller node
- Slurmdbd node which acts as a database backend
- MariaDB database
- 3 compute nodes which will run the jobs
Data is mounted from /data/slurm
host directory and is used to imitate a shared storage between the components. In multi-node deployment this would be backed by NFS or similar shared filesystems.
With Slurm you will normally interact using sbatch
and sacct
commands. For that we have provided the container that has Jupyter but to access it, you first need to access the machine.
To install CLI, run:
curl https://downloads.synpse.net/install-cli.sh | bash
Then, go to your profile page https://cloud.synpse.net/profile and click on "New API Key", this will show you the command how to authenticate.
With CLI configured, run:
synpse device proxy mini-slurm 8888:8888
This creates a secure tunnel to our Slurm cluster machine, open it on http://localhost:8888:
Start a terminal in Jupyter and run some jobs!
You can find some examples here: https://slurm.schedmd.com/sbatch.html.
For example, let's create a file job.sh
with contents:
#!/bin/bash
#
#SBATCH --job-name=test-job
#SBATCH --output=result.out
#
#SBATCH --ntasks=6
#
echo hello slurm
Now, to run it:
sbatch job.sh
You should see something like:
Submitted batch job 11
and to see the results:
cat result.out
hello slurm
You can check Makefile
for image build targets. Set your REGISTRY=my-own-registry
and then run:
make all
This will rebuild all Docker images.
You can edit the docker/node/Dockerfile and add more packages into the image. Then, rebuild the images, restart your application and you have it :)
Somehow slurm master was stopped? Had to SSH, exec into the slurmmaster container and start the service within it
Either restart the whole Slurm or just docker exec into the slurmmaster container and restart the service:
root@slurmmaster:~# service slurmctld start
* Starting slurm central management daemon slurmctld
port is set but still seeing error: "slurmdbd: error: _add_registered_cluster: trying to register a cluster (cluster) with no remote port" in the slurmdbd logs
Check cluster and then create it if it doesn't exist:
root@slurmjupyter:/home/admin# sacctmgr show cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
root@slurmjupyter:/home/admin# sacctmgr create cluster cluster
Adding Cluster(s)
Name = cluster
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
root@slurmjupyter:/home/admin#
root@slurmjupyter:/home/admin# sacctmgr show cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
cluster 0 0 1 normal