-
Notifications
You must be signed in to change notification settings - Fork 4
qsub
Directives are job specific requirements given to the job scheduler.
The most important directives are those that request resources. The most common are the wallclock time limit (the maximum time the job is allowed to run) and the number of processors required to run the job. For example, to run an MPI job with 16 processes for up to 100 hours on a cluster with 8 cores per compute node, the PBS directives are
#PBS -l walltime=100:00:00
#PBS -l select=2:mpiprocs=8
A job submitted with these requests runs for 100 hours at most; after this limit expires, the job is terminated regardless of whether the processing finished or not. Normally, the wallclock time should be conservative, allowing the job to finish normally (and terminate) before the limit is reached.
Also, the job is allocated two compute nodes (select=2) and each node is scheduled to run 8 MPI processes (mpiprocs=8). It is the task of the user to instruct mpirun to use this allocation appropriately, i.e. to start 16 processes which are mapped to the 16 cores available for the job. More information on how tu run MPI application can be found in this guide.
Supposing you already have a PBS submission script ready (call it submit.sh), the job is submitted to the execution queue with the command qsub script.sh. The queueing system prints a number (the job id) almost immediately and returns control to the linux prompt. At this point the job is already in the submission queue.
Once you have submitted the job it will sit in a pending queue for some time (how long depends on the demands of your job and the demand on the service). You can monitor the progress of the job using the command qstat.
Once the job is run you will see files with names like "job.e1234" and "job.o1234", either in your home directory or in the directory you submitted the job from (depending on how your job submission script is written). The ".e" files contain error messages. The ".o" files contain "standard output" which is essentially what the application you ran would normally have printed onto the screen. The ".e" file contains the possible error messages issued by the application; on a correct execution without errors, this file can be empty.
Read all the options for qsub on the Linux manual using the command man qsub.
qstat is the main command for monitoring the state of systems, groups of jobs or individual jobs. The simple qstat command gives a list of jobs which looks something like this:
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1121.hal jobName1 bob 0:45:05 R devel
1152.hal jobName2 mary 12:40:56 R long
1226.hal jobName3 steve 0 Q devel
The first column gives the job ID, the second the name of the job (specified by the user in the submission script) and the third the owner of the job. The fourth column gives the elapsed time for each particular job. The fifth column is the status of the job (R=running, Q=waiting, E=exiting, H=held, S=suspended). The last column is the queue for the job (a job scheduler can manage different queues serving different purposes).
Some other useful qstat features include:
- -u for showing the status of all the jobs of a particular user, e.g. qstat -u bob for user bob;
- -p for showing time as percentage of the wallclock requested in the submission script;
- -i for showing the status of a particular job, e.g. qstat -i 1121 for job with the id 1121.
Read all the options for qstat on the Linux manual using the command man qstat.
You can submit jobs without a submit script, but it can be more cumbersome.
Submit a simple job to the general routing queue (route)
qsub -q route -- /bin/sleep 10
Submit a job with the following parameters:
- Name: script-name
- Queue: route
- Walltime (runtime): 5 minutes
- cpus: 1
- error directory: /home/user/err
- output directory: /home/user/out
qsub -N script-name -V -q route -l walltime=00:00:05 -l select=1:ncpus=1 -e /home/user/err/ -o /home/user/out/ -- /bin/sleep 10