-
Notifications
You must be signed in to change notification settings - Fork 0
/
Guide_R-on-HPC.Rmd
469 lines (341 loc) · 19.8 KB
/
Guide_R-on-HPC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
---
title: "Practical guide to running R on HPC"
author: "Elzbieta Lauzikaite"
date: "5/3/2018"
output:
pdf_document:
fig_caption: yes
highlight: zenburn
number_sections: yes
toc: yes
toc_depth: 2
---
````{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
eval = TRUE,
message = FALSE,
warning = FALSE,
fig.retina = TRUE,
fig.pos = "H",
tidy.opts=list(width.cutoff=60))
# tidy=TRUE
# tm <- knitr::knit_theme$get("anotherdark")
# knitr::knit_theme$set(tm)
library(kableExtra)
```
# Imperial HPC systems
Imperial has three cluster systems, which are designed for different kind of computations. A computer cluster is system of loosely or tightly connected independent computers (**nodes**). These nodes are managed by a **batch system** (i.e. non-interactive processing mode), which monitors available resources (**NCPUS** and **memory**) and allocates users' tasks (**jobs**) to available nodes. If none are available, jobs are put on hold in a **queue**, together with the jobs of other users, until sufficient resources are free to use.
User should select a cluster system appropriate for the planned work.
````{r, echo=FALSE, out.width = "0.8\\textwidth"}
# fig.cap="\\label{fig:figs}Imperial HPC systems."
knitr::include_graphics(path = "HPC-systems.png")
```
```{r hpc_systems, echo=FALSE, results = 'asis'}
df <- read.csv("hpc-systems.csv")
colnames(df) <- gsub("\\.", " ", colnames(df))
# cap <- paste("Imperial HPC systems.")
knitr::kable(x = df, align = rep('l', 6), format = "latex", longtable = T, booktabs = T) %>%
kable_styling(latex_options = c("repeat_header", "striped", "hover"), font_size = 8, position = "left") %>%
column_spec(c(1,2,3), width = "18em")
```
## CX1
```{r cx1_limits, echo=FALSE, results = 'asis'}
df <- read.csv("cx1-joblim.csv")
colnames(df) <- gsub("\\.", " ", colnames(df))
cap <- paste("Jobs allowed on CX1 system.")
knitr::kable(x = df, align = rep('l', 6), format = "latex", caption = cap, longtable = T, booktabs = T) %>%
kable_styling(latex_options = c("repeat_header", "striped", "hover"), font_size = 8, position = "left")
```
## AX4
```{r ax4_limits, echo=FALSE, results = 'asis'}
df <- read.csv("ax4-joblim.csv")
colnames(df) <- gsub("\\.", " ", colnames(df))
cap <- paste("Jobs allowed on AX4 system.")
knitr::kable(x = df, align = rep('l', 6), format = "latex", caption = cap, longtable = T, booktabs = T) %>%
kable_styling(latex_options = c("repeat_header", "striped", "hover"), font_size = 8, position = "left")
```
\newpage
---
# Set up
## User account
To access any of the HPC systems, an account must first be created. A request must be made by your group leader through **[\textcolor{blue}{self-service portal}](https://api.rcs.imperial.ac.uk/registration/hpc)**. This will give you access to CX1 free-of-charge, for usage of AX4/CX2, contact HPC/RCS support team.
## Log in
All HPC systems can be accessed through the ssh command via terminal (Linux/Mac) when connected to Imperial VPN:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
```
Windows users can use software [\textcolor{blue}{Putty}](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html).
After login, every user is connected to the **login node**, which can accommodate a large number of users, but does not support heavy calculations. Therefore running applications directly on this node can crash the whole system, or at least terminate your connection and your work. Use this node to perform light work, such as data management, script preparation and job submission.
## File management
Decision on data storage should be made as early as possible, since its set up can significantly prolong the start of your work.
### Local HPC storage
For smaller data sets to be processed on CX1, the general advise would be to use local HPC storage. Aliquoted free-of-charge space is usually sufficient and can be extended further through the self-service portal. However, note that these \textcolor{red}{file systems are not appropriate for the storage of sensitive data}.
```{r local_storage, echo=FALSE, results = 'asis', message=FALSE}
df <- read.csv("local_storage.csv", header = T)
colnames(df) <- gsub("\\.", " ", colnames(df))
cap <- paste("Storage available to every user.")
knitr::kable(x = df, align = rep('l', 6), format = "latex", caption = cap, longtable = F, booktabs = T) %>%
kable_styling(latex_options = c("repeat_header", "striped", "hover"), font_size = 8, position = "left") %>%
row_spec(c(3,6,9), hline_after = T)
```
Files can be copied over to the HPC storage by:
1. Secure copy command
````{bash file_management_script, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
scp /files.tgz [email protected]:/home/username/file_location
```
2. FileZilla on Mac and Windows. FileZilla is also handy when managing your scripts and job output on the HPC.
### Remote storage
There is a number of options for remote data storage, though the default BOX and H:drive are not appropriate for sensitive data either.
```{r remote_storage, echo=FALSE, results = 'asis', message=FALSE}
df <- read.csv("remote_storage.csv", header = T)
colnames(df) <- gsub("\\.", " ", colnames(df))
cap <- paste("External storage options.")
knitr::kable(x = df, align = rep('l', 6), format = "latex", caption = cap, longtable = F, booktabs = T) %>%
kable_styling(latex_options = c("repeat_header", "striped", "hover"), font_size = 8, position = "left") %>%
row_spec(c(4,8), hline_after = T)
```
\newpage
---
# Preparing working environment
A large number of applications are already available on the systems. They are centrally installed and accessible by every user through *module* command. Check whether your application is available through *module avail* command on the login node.
For some applications and libraries, additional modules need to be loaded. Read more about modules on [\textcolor{blue}{Imperial website}](http://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/support/applications/).
To start R for the use of XCMS, the following modules need to be loaded during your session:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
module load intel-suite libxml2 hdf5/1.8.14-serial netcdf/4.4.1 R/3.4.0
module load boost
```
To load a centrally available R library, define the path to their location:
````{r,results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
.libPaths("/apps/R/3.4.0/lib64/R/library", .libPaths())
library(xcms)
```
While most popular R packages are already available on CX1, AX4 has fewer of them. You may need to install the missing ones yourself locally on your HOME directory while running R interactively on the login node. Make sure that **all supporting modules are loaded to your enviroment** before you start R. If a package is installed successfully, then you would load it by defining the path to your HOME. Some packages are hard to install locally due to the requirement of additional Linux libraries. In such case, submit a request to the [\textcolor{blue}{HPC help desk}](http://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/support/help/).
````{r,results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
.libPaths("/home/username/R/x86_64-pc-linux-gnu-library/3.4", .libPaths())
library(xcms)
```
\newpage
---
# Submitting jobs
Management of users' jobs is done by the PSBPro(Portable Batch System Professional) queue system.**PBS scheduler** allocates every job to nodes/cores according to the amount of resources requested by the user. The scheduler starts the job when sufficient resources are available, runs it and returns the output to the user.
Jobs are submitted via a **PBS script** - a bash script. Such script must include the following information: time of processing (*walltime*), required memory (*mem*) and number of nodes and cores (*select*, *ncpus*). If the requested resources are exceeded during the job run, the job is terminated. Only walltime now can be extended for running jobs via the self-service portal.
A PBS script *run.pbs*:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
#!/bin/sh
#PBS -N job_name
#PBS -l walltime=01:00:00
#PBS -l select=1:ncpus=1:mem=50gb
```
Submitting *run.pbs* via terminal:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
qsub run.pbs
```
If job was submitted succesfully, a job ID will be returned. Monitor submitted jobs:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
qstat
qstat -u username # all user's jobs
qstat –f job_id # returns more details on a single job
```
Once job is completed, it will be gone from the *qstat* list. You can also delete uncompleted jobs at any stage:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
qdel job_id
```
## Resource estimation
Before performing computations are the full-scale, it is advised to monitor resource usage on a small-scale. E.g. apply your code on just a single datafile at once, or run the code in a serial mode to get the estimate for a single **worker** before initiating a job with multiple parallel-workers.
To observe memory usage, one can define a parameter in the PBS script to send an email if the code is aborted by the batch system(a), when execution begins(b) or ends(e):
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
#!/bin/sh
#PBS -m abe
```
For parallel code on CX1, a more detailed memory usage can be performed using module *memusage*. The application will save the output in the current directory. The generated output files can be plotted using module *gnuplot*, more details can be found on [\textcolor{blue}{Imperial HPC Wiki page}](https://wiki.imperial.ac.uk/display/HPC/Memory+usage+monitor+for+cx1):
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
#!/bin/sh
#PBS -l walltime=01:00:00
#PBS -l select=1:ncpus=1:mem=1gb
cd $PBS_O_WORKDIR
/apps/memusage/memusage R
```
## Submitting R jobs
The best approch for running R scripts in batch mode is **Rscript**. Rscript logs all output and allows to submit command line arguments, both of which are essential when building an automated PBS-based pipeline. Variables, such as datafiles or function parameters, can be supplied as arguments, while re-using the same R code. This helps to perform optimisation tasks or carry out the same workflow with different datafiles.
A PBS script, which starts a serial R job with one argument, while capturing errors and details of the process in an output Rout file:
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
#!/bin/sh
#PBS -N job_name
#PBS -l walltime=01:00:00
#PBS -l select=1:ncpus=1:mem=50gb
module load intel-suite libxml2 hdf5/1.8.14-serial netcdf/4.4.1 R/3.4.0
module load boost
Rscript --no-save --no-restore --verbose your_code.R "args" > "/outfile.Rout" 2>&1
```
The corresponding *your_code.R* script:
```{r, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
args = commandArgs(trailingOnly=TRUE)
argument <- paste0(args[1])
```
\newpage
***
# Running R for XCMS on CX1
The most computationally demanding step of LC-MS data pre-processing with XCMS is initial peak-picking, which requires to load raw datafiles into R memory. Even though peak-picking algorithm is normally applied to all files at the same time, it is an "embarrassingly parallel"" task, since each datafile is processed separately inside the code and then the final single R object is obtained.
Large LC-MS datasets can be run on CX1 through the use of **arrays jobs**. An array job comprises of multiple independently running R sessions, which for the sake of simplicity, are initiated using a single PBS script. In each R session, the same R code is used with a different input data file.
## PBS-based pipeline for array XCMS job
1. Peak-picking of individual data files in independent R sessions (1st PBS script *run-xcms-1.pbs* initiates an array job)
2. Merging generated XCMS objects into a single one and running the remaining XCMS steps in a single R session (2nd PBS script *run-xcms-2.pbs* initiates a serial job)
*run-xcms-1.pbs* script initiates 10 copies of the same job. Each of these subjobs run independently and are identical except for the value of the environment variable *PBS_ARRAY_INDEX*, which ranges from 1 to 10. Each copy is allocated with 1 core and 16GB of memory for 5 hours.*PBS_ARRAY_INDEX* is used as Rscript argument and specifies a single file from the list of 10. Another environment variable *PBS_JOBNAME* encodes your given job name and is used to make a directory for job's output. Variable *PBS_JOBID* specifies the unique job ID, which is given by the PBS scheduler once a job is submitted to the queue. Job ID here is used to write a unique Rout file for easier comparison with other runs.
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
#!/bin/sh
#PBS -N xcms_v1
#PBS -l walltime=05:00:00
#PBS -l select=1:ncpus=1:mem=16gb
#PBS -J 1-10
dir="/work/username/scripts_dir"
file_list="/work/username/filelist.txt"
out_dir="/work/username/output_dir/$PBS_JOBNAME"
if [ ! -d "$out_dir" ]; then
mkdir "$out_dir"
fi
cd "$dir"
module load intel-suite libxml2 hdf5/1.8.14-serial netcdf/4.4.1 R/3.4.0
module load boost
Rscript --no-save --no-restore --verbose xcms-1.R "$file_list" "$PBS_ARRAY_INDEX"
"$out_dir" > "$out_dir/$PBS_JOBID.Rout" 2>&1
```
The corresponding *xcms-1.R* script pick-peaks a single datafile and saves xcmSet object into an rds file with unique name in the same directory:
```{r, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
args = commandArgs(trailingOnly=TRUE)
.libPaths("/apps/R/3.4.0/lib64/R/library")
library(xcms)
###--- Filelist
file_list <- paste0(args[1])
files <- read.table(file = file_list, stringsAsFactors = FALSE, header = F, sep = "\t")
###--- Datafile to process
f <- as.numeric(paste0(args[2]))
file <- files[f,1]
###--- Output dir
output_dir<- paste0(args[3])
###--- Peak picking ----
xset <- xcmsSet(files = file, ...)
saveRDS(xset, file = paste0(output_dir,"/xcms-1-", f, ".rds"))
```
*run-xcms-2.pbs* script initiates a serial job for remaining XCMS steps. It specifies the input directory where first jobs objects were generated, an output directory and the number of BPPARAM workers for fillPeaks() function (as many workers, as requested ncpus):
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
#!/bin/sh
#PBS -N xcms_v1
#PBS -l walltime=10:00:00
#PBS -l select=1:ncpus=24:mem=50gb
input_dir="/work/username/output_dir/$PBS_JOBNAME"
out_dir="$input_dir"
if [ ! -d "$input_dir" ]; then
echo "no input_dir, exit the job"
exit
fi
module load intel-suite libxml2 hdf5/1.8.14-serial netcdf/4.4.1 R/3.4.0
module load boost
Rscript --no-save --no-restore --verbose gset.R "$input_dir" "$out_dir" "$NCPUS"
> "$out_dir/$PBS_JOBID.Rout" 2>&1
```
*gset.R* script (note that graphic display is not supported on non-interactive jobs, thus plotting must be disabled):
```{r, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
args = commandArgs(trailingOnly=TRUE)
.libPaths("/apps/R/3.4.0/lib64/R/library")
library(xcms)
###--- Input dir
input_dir <- paste0(args[1])
###--- Output dir
output_dir<- paste0(args[2])
###--- BPPARAM workers
bw <- as.numeric(paste0(args[3]))
###--- Load xcmsSet objects into one
input <- list.files(input_dir, pattern = ".rds", full.names = T)
input_l <- lapply(input, readRDS)
xset <- input_l[[1]]
for(i in 2:length(input_l)) {
set <- input_l[[i]]
xset <- c(xset, set)
}
###--- Grouping ----
gset <- group(xset, ...)
save(list = c("gset"), file = paste0(output_dir,"/xcms-2.RData"))
###--- RT correctionn ----
rset <- retcor.peakgroups(gset, plottype = "none", ...)
save(list = c("rset"), file = paste0(output_dir,"/xcms-3.RData"))
###--- Grouping no2 ----
grset <- group(rset, ...)
save(list = c("grset"), file = paste0(output_dir,"/xcms-4.RData"))
###--- Filling peaks ----
fset <- fillPeaks(grset,
method="chrom",
BPPARAM = MulticoreParam(workers = bw))
save(list = c("fset"), file = paste0(output_dir,"/xcms-5.RData"))
```
Is possible to start a job on the condition that another one completes beforehand, where the input to one job is generated by the previous job in a pipeline. Job dependency is defined in PBS script using the -W flag.
```{bash, results = 'asis', message = TRUE, eval=FALSE, echo=TRUE}
XSET_JOB_ID=`qsub run-xset.pbs`
qsub -W depend=afterok:$XSET_JOB_ID run-gset.pbs
```
\newpage
***
# Running R for XCMS on AX4
If access to AX4 has been given to you, it could be an easier approach to using XCMS than the CX1, since a large amount of memory can be requested for a single R session. XCMS functions would be applied for all datafiles at the same time using multicore parallelisation supported by BiocParallel library.
## PBS script with check-points
The PBS scheduler on AX4 is prone to crashes, such as interruptions of the run and subsequent re-initiation. There is no way to save output of a session, which was terminated, but it is possible to re-iniate a code from the last check-point, rather than from the very start.
The check-points could be R objects, generated by individual XCMS functions throughout the workflow. To find the last check-point, PBS bash script would look for specifically named R objects in the output directory and initiate only one of the provided R scripts, depending on which R objects have already been generated. Different R scripts **start** the XCMS workflow at different stages. Each script saves one RData object:
1. *xcms-1.R* starts with xcmsSet()
2. *xcms-2.R* starts with group()
3. *xcms-3.R* starts with retcor()
4. *xcms-4.R* starts with second group()
5. *xcms-5.R* starts with fillPeaks()
```{r, engine = "bash", results = 'asis', message = TRUE, eval=FALSE, echo=TRUE, tidy=TRUE}
#!/bin/sh
#PBS -N xcms_v1
#PBS -l walltime=10:00:00
#PBS -l select=1:ncpus=40:mem=200gb
file_list="/work/username/filelist.txt"
out_dir="/work/username/out_dir/$PBS_JOBNAME"
check_1="$out_dir/xcms-1.RData"
check_2="$out_dir/xcms-2.RData"
check_3="$out_dir/xcms-3.RData"
check_4="$out_dir/xcms-4.RData"
check_5="$out_dir/xcms-5.RData"
if [ ! -d "$out_dir" ]; then
mkdir "$out_dir"
fi
module load intel-suite libxml2 hdf5/1.8.14-serial netcdf/4.4.1 R/3.4.0
module load boost
if [ -e "$check_5" ]; then
echo "run is over"
else
if [ -e "$check_4" ]; then
Rscript --no-save --no-restore --verbose xcms-5.R "$NCPUS" "$file_list" "$out_dir"
"$check_4" > "$out_dir/$PBS_JOBNAME.Rout" 2>&1
else
if [ -e "$check_3" ]; then
Rscript --no-save --no-restore --verbose xcms-4.R "$NCPUS"
"$file_list" "$out_dir" "$check_3" > "$out_dir/$PBS_JOBNAME.Rout" 2>&1
else
if [ -e "$check_2" ]; then
Rscript --no-save --no-restore --verbose xcms-3.R "$NCPUS" "$file_list"
"$out_dir" "$check_2" > "$out_dir/$PBS_JOBNAME.Rout" 2>&1
else
if [ -e "$check_1" ]; then
Rscript --no-save --no-restore --verbose xcms-2.R "$NCPUS"
"$file_list" "$out_dir" "$check_1" > "$out_dir/$PBS_JOBNAME.Rout" 2>&1
else
Rscript --no-save --no-restore --verbose xcms-1.R "$NCPUS"
"$file_list" "$out_dir" > "$out_dir/$PBS_JOBNAME.Rout" 2>&1
fi
fi
fi
fi
fi
```
\newpage
***
# Useful links
A lot of information provided here was sourced from:
1. [\textcolor{blue}{Imperial HPC Wiki page}](https://wiki.imperial.ac.uk/display/HPC/Getting+started)
2. [\textcolor{blue}{Imperial HPC course slides}](https://wiki.imperial.ac.uk/display/HPC/Courses)
3. [\textcolor{blue}{Imperial website}](http://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/computing/high-throughput-computing)
3. [\textcolor{blue}{Introduction to using R on HPC}](http://www.glennklockwood.com/data-intensive/r/on-hpc.html)
4. [\textcolor{blue}{PBS scheduler environment variables}](http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm)