Depending on your computing environment, scripts can be executed using GNU parallel (on personal computers or computing servers) or using BSUB (on modern large computing clusters) to run the analysis steps for many samples or loci in parallel. The first uses the GNU command parallel
to process multiple samples/loci in parallel, and the second uses the bsub
command to submit multiple jobs at once on large high-performance clusters.
The GNU parallel scripts take arguments via the available options, which are explained in the first section of each script (e.g. execute somescript.sh -s samples.txt -t 5
to apply an analysis step to all samples specified in samples.txt
using 5 threads). For reproducibility, automatically created log files will document which arguments were passed to each script.
The BSUB scripts have all their arguments set in the script (section # arguments
), and they are therefore scripts and log files at the same time. Where appropriate, log files are still automatically created, which ensures reproducibility. The section ## Resource usage
is used to set the amount of computing nodes (threads), memory (in MB) and time (in hours) needed for each submitted job. These need to be set according to the amount of data analyzed. They should be as close as possible to the actual resources being used. It's always good practice to test the resources needed per job on a subset of e.g. 5 jobs before starting a bing sequence of jobs. This can be achieved by inspecting the lsf.o${job_ID} file, e.g. using the program get.lsf.summary.sh lsf.o${job_ID}
, which displays a summary of average and maximum memory usage and computation time. This helps to set memory and time requirements for a big sequence of jobs, and prevents that submitted jobs are allocated too much resources, leading to inefficient use of shared computing power at the expense of other users, or too little resoruces, leading to premature job termination (automatic killing) with no results.
This tutorial is based on GNU parallel submission scripts, but analogous scripts for both solutions are available in the CaptureAl repository, and extending this turorial to comprise a full example using both solutions is work in progress.