FLAMESlurmBackend

The Flame Slurm Backend allows to use FLAME on SLURM HPC Clusters.

Installation

def deps do
  [
    {:flame_slurm_backend,github: "marcnnn/flame_slurm_backend"}
  ]
end

Usage

Configure the flame backend in our configuration or application setup:

  # application.ex
  children = [
    {FLAME.Pool,
    name: MyApp.SamplePool,
    code_sync: [
     start_apps: true,
     sync_beams: Kino.beam_paths(),
     compress: false,
     extract_dir: {
       FLAMESlurmBackend.SlurmClient,
       :path_job_id,
       [Path.absname("extract_dir")<>"/"]
     } 
   ],
   min: 0,
   max: 1,
   max_concurrency: 1,
   idle_shutdown_after: :timer.minutes(10),
   timeout: :infinity,
   boot_timeout: 360000,
   track_resources: true,
    slurm_job: """
      #!/bin/bash
      #SBATCH -o flame.%j.out
      #SBATCH --nodes=1
      #SBATCH --ntasks-per-node=1
      #SBATCH --time=01:00:00
      #SBATCH --mem=20G

      export SLURM_FLAME_HOST=$(ip -f inet addr show ib0 | awk '/inet/ {print $2}' | cut -d/ -f1)
      """
    }
  ]

The slurm_job defines the Slurm job that will run on each spawned machine.

The SLURM_FLAME_HOST environment variable is also explicitly customize to the infiniband interface, which will be used by the Erlang VM Distribution layer for low latency and high bandwidth communication:

export SLURM_FLAME_HOST=$(ip -f inet addr show ib0 | awk '/inet/ {print $2}' | cut -d/ -f1)

You will need to start the parent Erlang VM (the one that configures FLAME) with the same configuration. If you are using Livebook, here is a script that starts on a CUDA 12.5 with CUDNN in $HOME:

#!/bin/bash
export CUDA=/usr/local/cuda-12.5/
export CUDNN=$HOME/cudnn-linux-x86_64-9.5.0.50_cuda12-archive/
export PATH=$PATH:$CUDA/bin
export CPATH=$CPATH:$CUDNN/include:$CUDA/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDNN/lib
export MIX_INSTALL_DIR=$WORK/mix-cache

export SLURM_FLAME_HOST=$(ip -f inet addr show ib0 | awk '/inet/ {print $2}' | cut -d/ -f1)

epmd -daemon
LIVEBOOK_IP=0.0.0.0 LIVEBOOK_PASSWORD=***** MIX_ENV=prod livebook server --name livebook@$SLURM_FLAME_HOST

Prerequisites

The Flame Parent and the Slurm cluster need to be able to connect via Erlang PRC.

Env Variables

In order for the runners to be able to join the cluster, you need to configure a few environment variables on your pod/deployment:

How it works

The FLAME Slurm backend needs to run inside the cluster. The backend then sends the a command to queue a job to the cluster. This Job will than be scheduled if ressoucces are avaiable. If it not scheduled within the timeout the job is canceled to not block ressourcess that are no longer needed. To be able to run the runner with the correct enviorment cluster specific bash file needs to be created.

Cleanup

Slurm is configured to send a SIGTERM signal to FLAME 30 seconds before it terminates the Job, so it starts performing any cleaning up of Flame temporary Files.

If your Slurm cluster is not configured to delete the tmp folder you can use OTP supervisors and delete artifacts that you create on termination.

This implementation in Flame is a good reference how to do that: https://github.com/phoenixframework/flame/commit/e64ad84b695a7569a351b7e5717c27db97f2451c

Long running Jobs

If your job time is limited, Slurm will kill the Job while it is running. There are no mechanisms at the moment to not use the runner if time is about to run out. This would be a well appreciated contribution! Be aware that you might lose data because of this.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
config		config
lib		lib
.env		.env
.formatter.exs		.formatter.exs
.gitignore		.gitignore
.mise.toml		.mise.toml
.tool-versions		.tool-versions
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FLAMESlurmBackend

Installation

Usage

Prerequisites

Env Variables

How it works

Cleanup

Long running Jobs

Troubleshooting

About

Releases

Packages

Contributors 2

Languages

License

marcnnn/flame_slurm_backend

Folders and files

Latest commit

History

Repository files navigation

FLAMESlurmBackend

Installation

Usage

Prerequisites

Env Variables

How it works

Cleanup

Long running Jobs

Troubleshooting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages