Skip to content

How To Use Compute Canada Clusters

Mathieu Germain edited this page Jul 6, 2016 · 15 revisions

What are clusters anyway?

Simply put, a Computer Cluster is an agglomeration of computing resources that are easily accessible and shared fairly among its users.

Cluster Overview

A cluster is composed of 4 main parts:

  • The Login Nodes allow you to connect to the cluster, setup the software you'll need and launch experiments (jobs) on the Compute Nodes.
  • The Compute Nodes does the heavy lifting as they contain the actual computing resources. They replicate your environment from the login node and run the jobs.
  • The Scheduler is an intermediate between you and the Compute Nodes and is in charge of enforcing the fair share of the resources. It does so by first determining which compute node has the resources you requested, then determining your priority using various factor, and then it puts you at the right spot in the waiting queue.
  • The distributed file system accessible by all the nodes.

What is Compute Canada?

Compute Canada is a government-funded organization that builds, manages and maintains clusters for all Canadian scientists in academia. It is composed of 4 child organisations (regional partners) that manage the clusters locally in each population basin.

Creating a Compute Canada account

To access the clusters you have to first create an account at https://ccdb.computecanada.ca. Use a password with at least 8 characters, mixed-case letters, digits and special characters. Later you will be asked to create another password with those rules, and it’s really convenient that the two passwords are the same.

After creating your account, you have to apply for a “role” at https://ccdb.computecanada.ca/me/add_role. Which means telling Compute Canada to what professor/supervisor (called "sponsor" here) you are affiliated. This will allow them to know which cluster you can have access to, and track your usage.

You will need to wait for your sponsor to accept your request before going to the next step.

Choose your clusters

After receiving confirmation that your sponsor accepted your request, you’ll need to apply for a consortium account at https://ccdb.computecanada.ca/me/facilities. This implies creating a second account with the regional partner (e.g.: Calcul Québec) that manage the cluster you want access to.

The GPU clusters for Calcul Québec are Guillimin, Hadès and Helios. If you need CPU computing power I recommend applying for Mammouth. Ask your sponsor if they have a special allocation on any of those. You can always edit those choices later here. For more details on what hardware they each offer, visit those pages. GPUs & CPUs

Note: The password you choose here and the given username will be the one used to log in those clusters.

Accessing the clusters

To log into the clusters, you simply have to ssh to the right entry point of the wanted cluster. Those entry points are called interactive or login nodes, they are meant to set up the dependencies you need and launch experiments on the compute nodes.

Do not use the interactive node to run full experiments, as this will get you banned. You can use them for quick tests but not to run full experiments.

You will usually get the URL for a given cluster trough an email from the regional partner but here are some of them.

guillimin.hpc.mcgill.ca
helios.calculquebec.ca
hades.calculquebec.ca

Setting up your environment

Once you are logged in, it's time to setup your environment. The first thing to do is to ask your supervisor or the person in charge of your group if you have a prepared software stack to load. This will save you the trouble of installing everything yourself and save a lot of space group.

Using Module

All cluster will give you access to a library of software that you can access on demand. The standard way to access that software is with module. Be careful if you have a group stack, what you load here might conflict with it.

$ module list

Give you the list of what is loaded for your user. (i.e.: The software you can use at the moment.)

$ module avail 

Give a full list of what is available.

$ module avail <module_name>

List all module containing the name.

$ module spider <module_name>

Detailed search that support regex.

$ module load <module_name>

Load a given module for the current session. Won’t be reloaded at next login.

To load this module at login time, add that command to your ~/.bashrc.

$ module unload <module_name> 

Unload a given module for the current session

Finally, always read the module --help before using it, each cluster will have their own small differences.

Local installs

If you need extra software that is not available trough module, the standard way to install it locally on Linux is in ~/.local.

You will probably need to add the following exports to your ~/.bashrc.

export PATH=~/.local/bin/:$PATH
export CPATH=~/.local/include/:$CPATH
export C_INCLUDE_PATH=~/.local/include/:$C_INCLUDE_PATH
export CPLUS_INCLUDE_PATH=~/.local/include/:$CPLUS_INCLUDE_PATH
export LD_LIBRARY_PATH=~/.local/lib64/:~/.local/lib/:$LD_LIBRARY_PATH
export LIBRARY_PATH=~/.local/lib64/:~/.local/lib64/:$LIBRARY_PATH
export PYTHONPATH=~/.local/lib/:~/.local/lib/python2.7/site-packages/:$PYTHONPATH

If you have a group stack, this might already be there. Always talk to the person managing your group before installing software.

Using pip

If you are using python as I do, pip is very useful to install packages locally. If you are using Python 3, use pip3 instead.

$ pip list

List all installed packages.

$ pip search <package_name>

List all available packages in the PyPI repository containing the name you specified.

$ pip install --user <package_name>

Will install the python package locally in ~/.local.

$ pip install --user --upgrade <package_name>

Will upgrade the python package locally in ~/.local. Even if it was a pre-installed package in the system.

$ pip install --user git+<git_repo_url>

Will install the development version of that package locally in ~/.local. This works only if the repo is a proper python package with a setup.py.

$ pip install --user git+<git_repo_url>@<branch_name>

Same as above but for a given branch.

$ git clone <git_repo_url>
$ pip install --user -e <cloned_folder>

This will install your cloned repo locally in editable mode which allows you to have your development version available in your system without playing with pythonpaths. This might not work with old version of canopy.

Launching jobs

Once you’re all set it’s time to learn how to launch jobs on the compute nodes.

Here you have 2 choices:

  • Either you learn how to create PBS (Portable Batch System) files from various sources 1 2 3 4 and launch them with qsub or msub depending on the cluster
  • Or you learn to use tools like Jobman/Jobdispatch or Smart Dispatch that handle most/all of the boilerplate for you and more.

I strongly recommend using the latter.

The following concepts will be useful in both cases.

Queue:

A queue is a waiting list to use a specific set of resources. For example on Guillimin, the k20 queue will run your job on nodes that have 64 GB of ram 16 cores and two k20 GPUs.

To list all existing queues on a given cluster, you can use qstat -Q or take a look at the full not quite up to date list on Calcul Quebec’s wiki.

The test queues are meant to help you test your code rapidly before launching huge experiments. They have a small max_walltime, which ensures that your test will start quickly with high priority.

Walltime:

Walltime is the maximum time you think your code will run. The job will be killed after the walltime is reached. It’s important to know your running time because the smaller walltime you ask for, the faster your jobs will start (ie: higher priority).

PBS files + msub/qsub

PBS file example submit.sh

#!/bin/bash
#PBS -V
#PBS -q test
#PBS -A <YOUR_RAPID>
#PBS -l walltime=15:00
#PBS -l nodes=1:gpus=1

# Commands #
nvidia-smi

wait

Launch with $ msub submit.sh

Interactive jobs or starting a terminal on a compute node.

$ msub -l walltime=00:00:15:00,nodes=1:gpus=1 -A <YOUR_RAPID> -I -qtest

Remember that clusters are not meant to be used this way. This should be used for debugging only.

Using a dispatcher

To do the same thing as above with smartdispatch.

$ smart_dispatch.py -qtest launch nvidia-smi

Managing jobs

Jobs Status:

R = Running

Q = Queued

C = Completed

E = Error

To monitor the status of your jobs:

$ qstat -u $USER

To see the status of your job on a given queue:

$ qstat -u $USER gpu_1

To see a guestimate of when your job will start: (Note: Don't be obsess with that, launch your job and go get a coffee.)

$ showstart <job_id>

To kill a job:

$ qdel <job_id>

Cluster Specific Commands

Helios

This command will give you information about the current load of the cluster, allocation and disk quota usage:

$ helios-info