Skip to content

This is a Small Tutorial how to run HPC Jobs (This is meant for IISER Mohali Community)

License

Notifications You must be signed in to change notification settings

dsjena/HPCSchool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Minimal Know How - HPC Computing Facility

How do I connect to HPC?

To connect using a Mac or Linux, open the Terminal application and use your SSH command. Detail can be found here

Can I compile code?

Yes.

We have the full GNU tool-chain available on both the login nodes so normal compilation tools such as autoconf, automake, libtool, make, gcc, g++, gfortran, gdb, ddd, java, python, perl, etc are available to you. Please let us know if there are other tools or libraries you need that aren’t available.

Compiling your own code

Use gnu-tool chain.

Adding a Package

Adding a package is similar to keeping the executation or your own program. If you keep any 'executable' in your area, it will be accessible in all computenode.

The detail is given here.

Batch Job & Queues

Job Scheduler

Our cluster is configured with SGE Batch Job Submission tool. The Sun Grid Engine (SGE) queuing system is useful when you have a lot of tasks to execute and want to distribute the tasks over a cluster of machines. It has three basic features

  • Scheduling - allows you to schedule a virtually unlimited amount of work to be performed when resources become available. This means you can simply submit as many tasks (or jobs) as you like and let the queuing system handle executing them all.

  • Load Balancing - automatically distributes tasks across the cluster such that any one node doesn’t get overloaded compared to the rest.

  • Monitoring/Accounting - ability to monitor all submitted jobs and query which cluster nodes they are running on, whether they’re finished, encountered an error, etc. Also allows querying job history to see which tasks were executed on a given date, by a given user, etc.

If you would like to know more, you may prefer to check a plenty of documents available in the web. Let me start with what minimal set of information (mostly taken from web from several sources)

Submitting Jobs

A job in SGE represents a task to be performed on a node in the cluster and contains the command line used to start the task. A job may have specific resource requirements but in general should be agnostic to which node in the cluster it runs on as long as its resource requirements are met.

All jobs require at least one available slot on a node in the cluster to run.

Submitting jobs is done using the 'qsub' command. Let’s try submitting a simple job that runs the 'hostname' command on a given cluster node:

sjena@usernode~$ qsub -V -b y -cwd hostname
Your job 1 ("hostname") has been submitted
  • The -V option to qsub states that the job should have the same environment variables as the shell executing qsub (recommended)

  • The -b option to qsub states that the command being executed could be a single binary executable or a bash script. In this case the command hostname is a single binary. This option takes a y or n argument indicating either yes the command is a binary or no it is not a binary.

  • The -cwd option to qsub tells Sun Grid Engine that the job should be executed in the same directory that qsub was called.

  • The last argument to qsub is the command to be executed (hostname in this case)

Notice that the qsub command, when successful, will print the job number to stdout. You can use the job number to monitor the job’s status and progress within the queue as we’ll see in the next section.

Job submissions are done in a dedicated Node called 'usernode' in our HPCIISERM, executions happen on all compute nodes. You are requested not to execute any job in 'usernode' - node where you login in. If you run manual jobs in usernode, jobs will be killed automatically. Therefore, you [red]# MUST # use SGE commands for job submission, for instance, 'qsub' to submit job. If you use 'qsub', the submission will be taken care automatically.

Never login to 'hpciiserm' as well as NEVER should submit any job in 'hpciiserm' - Jobs will be terminated without notice and the automated script might block the account.

Monitoring Jobs in the Queue

Now that our job has been submitted, let’s take a look at the job’s status in the queue using the command 'qstat':

[sjena@usernode ~]$ qstat
job-ID  prior name  user state submit/start at queue      slots      ja-task-ID
...............................................................................
4001 0.55500 hostname   sjena r  03/03/2017 10:16:32 [email protected]  1
[sjena@usernode ~]$

From this output, we can see that the job is in the 'r' state which is running in node compute-0-0. Once the job has finished, the job will be removed from the queue and will no longer appear in the output of 'qstat'. You should see the job outputs

Outputs

SGE creates stdout and stderr files in the job’s working directory for each job executed. If any additional files are created during a job’s execution, they will also be located in the job’s working directory unless explicitly saved elsewhere (I will discuss later).The job’s stdout and stderr files are named after the job with the extension ending in the job’s number. For the simple job submitted above, we have:

[sjena@usernode ~]$ ls
hostname.e4001  hostname.o4001  mpi
[sjena@usernode ~]$ cat hostname.e4001
[sjena@usernode ~]$ cat hostname.o4001
compute-0-0.local
[sjena@usernode ~]$

Notice that SGE automatically named the job 'hostname' and created two output files: 'hostname.e4001' and 'hostname.o4001'. The 'e' stands for stderr and the 'o' for stdout. The '4001' at the end of the files’ extension is the job number. So if the job had been named 'extraordinary_job' and was job '#47' submitted, the output files would look like: 'extraordinary_job.e47' 'extraordinary_job.o47'

Deleting a Job

What if a job is stuck in the queue, is taking too long to run, or was simply started with incorrect parameters? You can delete a job from the queue using the 'qdel' command in SGE. Below we launch a simple job 'mpi-ring.qsub', and we can kill it using 'qdel':

[sjena@usernode mpi]$ qsub -pe orte 24 mpi-ring.qsub
Your job 4009 ("mpi-ring.qsub") has been submitted

Check the Job status

[sjena@usernode mpi]$ qstat
job-ID  prior name  user state submit/start at queue      slots      ja-task-ID
...............................................................................
   4009 0.00000 mpi-ring.q sjena        qw    03/03/2017 11:24:09            24

'qw' means Job is in waiting stage. you can an kill it here, but we want job to run and then kill. Checking again after few second:

[sjena@usernode mpi]$ qstat
job-ID  prior name  user state submit/start at queue      slots      ja-task-ID
...............................................................................
   4009 0.55500 mpi-ring.q sjena r 03/03/2017 11:24:17 [email protected] 24

'r' means Job has started. Send a kill signal by 'qdel jobid' and check the status.

[sjena@usernode mpi]$ qdel 4009
sjena has registered the job 4009 for deletion

[sjena@usernode mpi]$ qstat

After running qdel you’ll notice the job is gone from the queue: 'qstat' returns nothing, i.e. 'qdel' has killed the job.

Monitoring Cluster Usage

SGE uses 'qstat' to check the job status. I have submitted 10 jobs, which need 10 core each. I would type 'qstat' and it will show me everything

[sjena@usernode mpi]$ qstat
job-ID  prior name  user state submit/start at queue      slots      ja-task-ID
...................................................................................
   4010 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4011 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4012 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4013 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4014 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4015 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4016 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4017 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4018 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
   4019 0.55500 mpi-ring.q sjena  r 03/03/2017 11:35:02 [email protected] 10
[sjena@usernode mpi]$

You can also view the average load (load_avg) per node using the ‘-f’ option to qstat:

[sjena@usernode mpi]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch      states
..............................................................................
[email protected]        BIP   0/0/12         0.00     linux-x64
..............................................................................
[email protected]        BIP   0/0/12         0.00     linux-x64
..............................................................................ZZZZ
[email protected]       BIP   0/0/12         0.00     linux-x64
..............................................................................
[email protected]       BIP   0/12/12        0.00     linux-x64
   4010 0.55500 mpi-ring.q sjena        r     03/03/2017 11:35:02    10
   4011 0.55500 mpi-ring.q sjena        r     03/03/2017 11:35:02     2
..............................................................................
[email protected]       BIP   0/12/12        0.00     linux-x64
   4011 0.55500 mpi-ring.q sjena        r     03/03/2017 11:35:02     8
   4012 0.55500 mpi-ring.q sjena        r     03/03/2017 11:35:02     4
....

....
..............................................................................
[email protected]        BIP   0/0/12         0.00     linux-x64
..............................................................................
[email protected]        BIP   0/0/12         0.00     linux-x64
..............................................................................
[sjena@usernode mpi]$

qsub scripts

In the ‘Submitting a Job’ section we submitted a single command 'hostname'. This is useful for simple jobs but for more complex jobs where we need to incorporate some logic we can use a so-called 'job script'. A job script is essentially a bash script that contains some logic and executes any number of external programs/scripts. The shell script that you submit (for example 'job_name.sh') should be written in 'bash' and should completely describe the job, including where the inputs and outputs are to be written (if not specified, the default is your home directory). The following is a simple shell script that defines 'bash' as the job environment, calls 'date', waits 20s and then calls it again.

#!/bin/bash

# request Bourne shell as shell for job
#$ -S /bin/bash

# print date and time
date
# Sleep for 20 seconds
sleep 20
# print date and time again
date

Note that your script has to include (usually at the end) at least one line that executes something

  • generally a compiled program but it could also be a Perl or Python script (which could also invoke a number of other programs). Otherwise your SGE job won’t do anything.

And to submit above script 'job_name.sh', you would do:

[sjena@usernode mpi]$ qsub -V job_name.sh
Your job 4048 ("job_name.sh.sh") has been submitted
----------------------------

Using qsub scripts to keep data local
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

HPC depends on a network-shared '/data' filesystem.  The actual disks are on a network file server
node so users are local to the data when they log in.  However, when you submit an SGE job, unless
otherwise specified, the nodes have to read the data over the network and write it back across the
network.  This is fine when the total data involved is a few MB, such as is often the case with
molecular dynamics runs - small data in, lots of computation, small data out.  However, if your
jobs involve 100s or 1000s of MB, the network traffic can grind the entire cluster to a halt.

To prevent this network jam, there is a large '/tmp' directory on each node (writable by all
users, but 'sticky' - files written can only be deleted by the user who wrote them or by admin.
However, if you use hundreds of GB, the onus is on you to clean up your files and decrease that
usage as soon as you’re done with it. (automatic script as regular cleanup will be added if needed).

following is an example script (self explanatory)

-----------------------------------------------
#!/bin/bash
################################
# Example to use /tmp space - qsub script
# Written for IISER HPC community
# Author: S. Jena
# Sat Mar  4 14:23:52 IST 2017
################################
#
###### BEGIN SGE PARAMETERS - note the '#$' prefix ######
###### DO NOT SET THE -cwd flag for a /tmp job
#
#$ -S /bin/bash
# specify the name of the job displayed in 'qstat' output
#$ -N sjena-job
#
######## Where to keep Log Output #########
# Make sure you have  a directory log in your HOME
#$ -o log/
#$ -e log/

###### BEGIN  /tmp DIR CODE  ######
# set the STDATA to point to the node-local /tmp dir and make sure you
# place the files in your own subdir. '${USER}' is global environment
# variable inherited by all your processes, so you shouldn't have to
# define it explicitly

#JOB_ID get the job number and we keep output in folder with this number

COPUT="ANYINDEX${JOB_ID}  # change ANYINDEX to anything of your choice

STDATA="/tmp/${USER}/"${COPUT}  # STDATA - Standard output folder in /tmp

####$HOME is another automatic global variable

MYAPP="${HOME}/test/my_executable_bin"  # Path to executable
FOUTPUTD="${HOME}/test/output"          # final output folder (global)

# 'mkdir -p' creates all the nec dirs to the final dir specified if
# needed and does not complain if it exists already

mkdir -p ${STDATA}  # creates dir on the local compute node /tmp/user
mkdir -p ${FOUTPUTD} # creates the dir in your $HOME - final output

cd ${STDATA}

# since this job will be done on many nodes, just to check where it runs

cd ${STDATA}
FILE=`hostname`
pwd

${MYAPP} > output.${JOB_ID}.txt   # keeping outut into a file with JOB_ID name

# Once Done move out the outputs form /tmp directory

cd ../
cp -r ${COPUT} ${FOUTPUTD}/.

# and clean up your mess on /tmp
rm -rf ${COPUT} # clean up the /scratch
-------------------------------------------------------

In this example all data output will be stored in '$HOME/test/output' and this is
expected to be small output. If it is large output, you must redirect it to the
NFS data-space.

Similarly, you can write scripts for parallel jobs.

== Parallel Jobs

We have already set the OpenMPI and integrated to SGE. This integration allows Sun Grid Engine
to handle assigning hosts to parallel jobs and to properly account for parallel jobs.

OpenMPI Parallel Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
StarCluster by default sets up a parallel environment, called 'orte', that has been
configured for OpenMPI integration within SGE and has a number of slots equal to the
total number of processors in the cluster. You can inspect the SGE parallel environment
by running:

-------------------------------------------
[sjena@usernode ~]$ qconf -sp orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
[sjena@usernode ~]$
-------------------------------------------
NOTE: at this stage we don't support 'allocation_rule for round_robin'. This is important for those who are using it.
This is the default configuration. With this allocation, if a user requests 8 slots and a single machine has 8 slots
available, that job will run entirely on one machine. If 5 slots are available on one host and 3 on another, it will
take all 5 on that host, and all 3 on the other host.

Submitting OpenMPI Jobs
~~~~~~~~~~~~~~~~~~~~~~~

The general workflow for running MPI code is: Compile the code using 'mpi compilers' like 'mpicc'.. The produced executable can be used in parallel environment of SGE.
It is important that the path to the executable is identical on all nodes for mpirun to correctly launch your parallel code. The easiest approach is to copy the executable somewhere under '/home/user' on the usernode since '/home/user' is NFS-shared across all nodes in the cluster.

 Run the code on X number of machines using:
----------------------------------------------------------
    $ mpirun -np X -hostfile myhostfile ./mpi-executable arg1 arg2 [...]
----------------------------------------------------------
where the hostfile looks something like:
------------------------
$ cat /path/to/hostfile
compute-0-0    slots=4
compute-0-1    slots=4
compute-0-11   slots=4
compute-0-12   slots=4
compute-0-13   slots=4
------------------------

However, when using an SGE parallel environment with OpenMPI you no longer have to specify the -np, -hostfile, -host, etc. options to mpirun.
This is because SGE will automatically assign hosts and processors to be used by OpenMPI for your job. You also do not need to pass the -byslot and -bynode options to mpirun given that these mechanisms are now handled by the fill_up and round_robin modes specified in the SGE parallel environment.

Instead of using the above formulation create a simple job script that contains a very simplified mpirun call:
----------------------
$ cat myjobscript.sh
mpirun /path/to/mpi-executable arg1 arg2 [...]
----------------------
Then submit the job using the qsub command and the orte parallel environment automatically configured for you by StarCluster:

----------------------
$ qsub -pe orte 24 ./myjobscript.sh
----------------------
The -pe option species which parallel environment to use and how many slots to request. The above example requests 24 slots (or processors) using the orte parallel environment. The parallel environment automatically takes care of distributing the MPI job amongst the SGE nodes using the allocation_rule defined in the environment’s settings.

NOTE: If you believe I have made some mistake, or the new implimentation needs more
clarification, let me know.

== Installation of New Software/Packages

There are several softwares/packages are installed (open source) centrally and the list is
[available](https://dsjena.github.io/HPCSchool/packages.html).

On contrary, it is also possible to install application/software by 'user' into their
own area. Once a user compiles and install, the binary will be automatically available
in every compute-node. In case of software with large (>2GB) size,
'user' may request to 'sysad' to install the application/software centrally (follow point 2).

NOTE: (1): There might be a situation where an 'user' wants to install a package which
is purchased by him/her: install it in your own area. Please note that, it will be the
responsibility of respective 'user' to take care the licensing and authorizations.

NOTE: (2): Central installation (open-source package): 'user' may request
to install it centrally. In this case, 'user' should provide us the detail of software
(like sources, web links etc.), the dependencies, pre-requisites and compilation procedure.
In some occasion, 'sysad' may ask the concerned user to assist installation.

NOTE: (3): Central installation (licenced package): The licencing should be of
'cluster' type or 'group-licence' type.  'user' should provide info as above and
 provide the licencing process (may help to secure lincence).


== Author's Note

This document is very naive and under continuous change depending on the requests we recieve.
It is being written by collecting information available with me and from several sources available
on internet. So, Do not hesitate to contact me. 

Document version: V1.1.1 12/06/2017 +
Document version: V1.1.0 23/03/2017 +
Document version: V1.0.2 16/03/2017 +
Document version: V1.0.1 04/03/2017 +

About

This is a Small Tutorial how to run HPC Jobs (This is meant for IISER Mohali Community)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages