-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path04-doing-work-with-dx-run.qmd
497 lines (342 loc) · 19.9 KB
/
04-doing-work-with-dx-run.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
# Working with files using `dx run` {#sec-dxrun}
:::{.callout-note}
## Prep for Exercises
Make sure you are logged into the platform using `dx login` and that your course project is selected with `dx select`.
In your shell, make sure you're in the `bash_bioinfo_scripts/dx-run/` folder:
```
cd dx-run/
```
:::
## Learning Objectives
1. **Utilize** an all purpose app (`swiss-army-knife`) to run bioinformatics jobs using `dx run` on files in a DNAnexus project
1. **Utilize** multiple inputs in a Swiss Army Knife job.
1. **Explain** and **utilize** multiple options in a Swiss Army Knife job.
1. **Wrap** R and Python scripts using a bash script to run it on the cloud using `dx run`
1. **Tag** output files on the platform for downstream use.
## Doing Bioinformatics with the Swiss Army Knife (SAK) App
Let's focus on key bioinformatics tasks we need to accomplish. To make things easier, we'll be using [Swiss Army Knife](https://platform.dnanexus.com/app/swiss-army-knife), which is an app on the platform that contains a number of commonly used bioinformatics utilities, including:
:::{.column-body layout-ncol="2"}
- `bcftools`
- `bedtools`
- `BGEN`
- `bgzip`
- `BOLT-LMM`
- `Picard`
- `Plato`
- `plink`
- `plink2`
- `QCTool`
- `REGENIE`
- `sambamba`
- `samtools`
- `seqtk`
- `tabix`
- `vcflib`
- `vcftools`
:::
Swiss Army Knife also contains R and Python 3.
Everything you'll learn in this section will also be applicable to building apps on the platform as well. You'll learn the foundations of running a script on the platform, which is halfway to building your own apps on the platform.
## Running Jobs on the DNAnexus platform using `dx run`
Our main command for running jobs is `dx run`. `dx run` lets us submit jobs to be run on the platform. These jobs can be to process files (such as aligning FASTQ files to a genome), or they can be for web apps, such as LocusZoom (for visualizing).
If you have used SLURM on an on-premise HPC system, the equivalent command would be `srun`, and if you have used PBS, the equivalent command would be `sbatch`.
## Try out your first job
In the sample project, you will see a `.bam` file in `data/` called `NA12878.bam` with no index. Let's create an index for this file with `sambamba` by running it with `dx run app-swiss-army-knife`.
```{bash}
#| eval: false
#| filename: dx-run/run-sambamba.sh
dx run app-swiss-army-knife \
-iin="/data/NA12878.bam" \
-icmd="sambamba index *"
```
Let's take this code apart. We start our command with `dx run` and the name of the app on the platform we want to use: `app-swiss-army-knife`.
The second line contains the file input we want to process, which is a file in our current project. Note that the input is specified as `-iin`, not `--in` or `--iin`. Using a single hyphen `-` here instead of a double hyphen `--` for our inputs is different than we might expect for other linux/UNIX parameters.
The third line is the actual command we want to run in Swiss Army Knife. We want to run `sambamba index` on our input file.
When you run the above code, either by pasting it into your shell, or using `bash run-sambamba.sh` you'll see the following response.
```
Using input JSON:
{
"cmd": "sambamba index *",
"in": [
{
"$dnanexus_link": {
"project": "project-GGyyqvj0yp6B82ZZ9y23Zf6q",
"id": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"
}
}
]
}
Confirm running the executable with this input [Y/n]: Y
Calling app-GFxJgVj9Q0qQFykQ8X27768Y with output destination
project-GGyyqvj0yp6B82ZZ9y23Zf6q:/
Job ID: job-GJ0G58Q0yp65gzJzKjYv04bZ
Watch launched job now? [Y/n] Y
```
Respond with `Y` when it asks
`Confirm executable with this input [Y/n]:`
And when it asks you
`Watch launched job now? [Y/n]`
:::{.callout-note}
## Try it out!
Run the above code (also in `dx-run/run-sambamba.sh`) and look at the log that is generated. Take a look at it carefully, especially the output log.
What file does it create? Where does it create that file?
:::
:::{.callout-note collapse="true"}
## Answer
If you use `dx tree` to take a look at the project, you'll see that there was a file `NA12878.bam.bai` that was created in the root of our project.
```
dx tree
── data
│ ├── NA12878.bai
│ ├── NA12878.bam
│ ├── NC_000868.fasta
│ ├── NC_001422.fasta
│ ├── small-celegans-sample.fastq
│ ├── SRR100022_chrom20_mapped_to_b37.bam
│ ├── SRR100022_chrom21_mapped_to_b37.bam
│ └── SRR100022_chrom22_mapped_to_b37.bam
└── NA12878.bam.bai
```
If we wanted to control where our index file ended up, we can use the `--destination` option to specify our destination, such as
`--destination data/`
:::
### What about the outputs?
One of the nice things about Swiss Army Knife is that it automatically transfers files that we generate on the worker. In our above job, we generated the corresponding index file `NA12878.bam.bai` from our `-icmd` input. It was automatically transferred over from the worker to the root of our project.
If we wanted to change the output folder, we can use the `--destination` flag, like below:
```{bash}
#| eval: false
dx run app-swiss-army-knife \
-iin="/data/NA12878.bam" \
-icmd="sambamba index *" \
--destination "data/"
```
Another nice thing about `dx run` is that it will automatically create the folder we specify with `--destination` if it doesn't exist in project storage.
:::{.callout-note}
## What if we run our statement multiple times?
We can actually run our `dx run` statement multiple times. The output will be multiple files with the same name in the `destination` directory.
What's going on here? Well, files are considered as *objects* on the platform. Each copy generated by our `dx run` are considered unique objects on the platform, with separate file identifiers.
:::
## Building on our first `dx run` job
We've seen how to specify file inputs to Swiss Army Knife. Let's dive into some of the intricacies of using the `-icmd` input.
### Submitting a script to the worker {#sec-worker-script}
The `-icmd` input is helpful for small scripting jobs, but you often want to do a bunch of things in a Swiss Army Knife job.
Bash scripting to the rescue! Say we have a script called `run-on-worker.sh` that takes 1 positional input, which is a file path. We'll run this script on the worker:
```{bash}
#| eval: false
#| filename: worker-scripts/run-on-worker.sh
sambamba $1 > $1.bai
samtools stats $1 > $1.stats.txt
```
When we set up a `dx run app-swiss-army-knife`, we'll use this script as one of the inputs to `dx run`:
```{bash}
#| eval: false
#| filename: dx-run/dx-run-script.sh
dx run app-swiss-army-knife \
-iin="worker_scripts/run-on-worker.sh" \
-iin="data/NA12878.bam" \
-icmd="bash run-on-worker.sh $in_path"
```
The magic here is that we're using `bash` in our `cmd` input to run the script `run-on-worker.sh`. Notice this script is already in our project in the `worker_scripts/` folder, so we can submit it as a file input:
```
-iin="worker_scripts/run-on-worker.sh"
```
:::{.callout-note}
# Why this is important on the platform
This is a common pattern that we'll use when we need to do more complicated things with Swiss Army Knife. It doesn't seem that helpful right now, but it will once we get to batch scripting.
:::
You'll notice that we are using a special helper variable here called `${in_path}` - this gives the current path of the file on the *worker*. It is tremendously helpful in scripting. Let's learn about these built-in helper variables next.
### Bash Helper Variables
There are some helper variables based on the file inputs specified in `-iin`.
These variables are really helpful in scripting in our `-icmd` input. Say we ran our command:
```
dx run app-swiss-army-knife \
-iin="/data/NA12878.bam" \
-icmd="sambamba index ${in_path}"
```
You'll see that we are using a special variable called `$in_path` here to specify the file name. This is called a *helper variable* - they are available based on the different file inputs we submit to Swiss Army Knife.
For our one input `-iin="data/NA12878.bam"`, this is what these helper variables contain:
| Bash Variable | Value | Notes |
|---------------|-------|-------|
| `$in` | `data/NA12878.bam` | Location in project storage |
| `$in_path` | `~/NA12878.bam` | Location in worker storage |
| `$in_name` | `NA12878.bam` | File name (without folder path) |
| `$in_prefix` | `NA12878` | File name without the suffix |
`$in_prefix` is really helpful when you want to create files that have the same prefix as our file. Say we wanted to run `samtools view -c` on our file. We want a file called `NA12878.count.txt` that contains the read counts.
```
dx run app-swiss-army-knife \
-iin="data/NA12878.bam" \
-icmd="samtools view -c ${in_path} > ${in_prefix}.counts.txt"
```
What happens when the command is run on the worker is that the variables are expanded like this:
```
samtools view -c ~/NA12878.bam > NA12878.counts.txt
```
:::{.callout-note}
## What is the difference between `$in` and `$in_path`?
You may have looked at the above table and been slightly confused about these two variables.
`$in` is the file location in **project storage**, while `$in_path` is the file location in the temporary **worker storage**.
When writing scripts that run on the worker with the `-icmd` input, `$in_path` is much more useful.
:::
:::{.callout-note}
## Why are we learning about helper variables?
Writing scripts that can be run on the worker is part of the app building process, which lets us specify our own executables.
If you can get a handle on working with helper variables in `-icmd`, then you are most of the way to building an app.
:::
## More Swiss Army Knife
There is a table in the [Swiss Army Knife documentation](https://platform.dnanexus.com/app/swiss-army-knife) that specifies the inputs and how to configure the `-icmd` parameter for each operation.
One thing you'll notice is that some of them require multiple file inputs. How do you specify them?
### Multiple file inputs to Swiss Army Knife
One thing that may not be apparent to you is that you can specify multiple file inputs with multiple `-iin` options.
The Swiss Army Knife Documentation notes that `-iin` is actually an *array* file input. That means that we can submit multiple files like this:
```{bash}
#| eval: false
#| filename: dx-run/dx-run-multiple-inputs.sh
dx run app-swiss-army-knife \
-iin="data/SRR100022_chrom20_mapped_to_b37.bam" \
-iin="data/SRR100022_chrom21_mapped_to_b37.bam" \
-iin="data/SRR100022_chrom22_mapped_to_b37.bam" \
-icmd="ls *.bam | xargs -I% sh -c 'samtools view -c % > %.count.txt'"
```
If we want to process each of these files, we'll have to use `xargs` in our `icmd` statement:
```
-icmd="ls *.bam | xargs -I% sh -c 'samtools view -c % > %.count.txt'"
```
We'll cover more of this in @sec-batch.
### Using PLINK triplets {#sec-plink}
If we're working with a PLINK trio of files (`.bed`, `.bim`, and `.fam` files), we can use the file inputs as below. We'll submit the trio of files separately, and use the `--bfile` option in `plink` to specify all three files.
```{bash}
#| eval: false
#| filename: dx-run/run-plink.sh
dx run app-swiss-army-knife \
-iin="data/plink/chr1.bed" \
-iin="data/plink/chr1.bim" \
-iin="data/plink/chr1.fam" \
-icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc"
```
We will output the QC'ed and filtered files as `chr1_qc.bed`, `chr1_qc.bim`, and `chr1_qc.fam`. We do that by specifiying the `--out` option for `plink`.
## `dx run` runs deep
If you've looked at the `dx run` help page, you'll notice it is quite long. There are a lot of options. Let's look at a few of them that are really helpful when you're getting started.
- `--tag` - please use these. Your jobs will be findable with a tag.
- `--destination` - destination folder in project. If the path doesn't exist, it will be created.
- `--instance-type` - Instance types used in the job.
- `--watch` - run `dx watch` for the job id that is generated
- `-y` - non-interactive mode. Replies yes to the questions and starts up `dx watch` for that job id.
- `--allow-ssh` - very helpful in debugging jobs. Allows you to `dx ssh` into that worker and check out the status.
- `--batch-tsv` - works with the tab-delimited files that you generate using `dx generate_batch_inputs`.
- `--detach` - if a subjob of a job, it detaches the job. This is really important in batch operations where a top-level job is generating a subjob. Running the subjobs as detached means that the entire top-level job will not fail if a subjob fails.
I'll outline some examples where we leverage the options below.
### Save our output files into a different folder
As we mentioned above, by default our output files are generated in the root folder of the project. If we want to control this, we can use the `--destination` option.
Remember, if the folder our `--destination` option doesn't exist, `dx run` will create that folder and send the outputs into the newly created folder.
```{bash}
#| eval: false
#| filename: dx-run/run-destination.sh
dx run app-swiss-army-knife \
-iin="data/chr1.bed" \
-iin="data/chr1.bim" \
-iin="data/chr1.fam" \
-icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
--destination "results/"
```
### Run our job on a different instance type
We can specify a different instance type (@sec-instance) with the `--instance-type` parameter.
```{bash}
#| eval: false
#| filename: dx-run/run-instance.sh
dx run app-swiss-army-knife \
-iin="data/chr1.bed" \
-iin="data/chr1.bim" \
-iin="data/chr1.fam" \
-icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
--instance-type "mem1_ssd1_v2_x8"
```
### Tag and rename our job
We can change the name of our job using `--name` - this can be very helpful when running a bunch of jobs. Here we're adding `$in_prefix` so we can see what Chromosome we're running on. Very helpful in batch submissions.
Adding a tag to our job, such as the run number, can be very helpful in terminating a large set of jobs (@sec-terminate).
```{bash}
#| eval: false
#| filename: dx-run/run-job-tag-name.sh
dx run app-swiss-army-knife \
-iin="data/chr1.bed" \
-iin="data/chr1.bim" \
-iin="data/chr1.fam" \
-icmd="plink --bfile chr1 --geno 0.01 --make-bed --out chr1_qc" \
--tag "plink_job" --name "Run PLINK on fileset: $in_prefix"
```
### Clone an failed job
If you have run a job on a lower priority, your job has a chance of getting bumped (stopped). If that's the case, your job will fail.
That's when using the `--clone` parameter can come in handy.
```{bash}
#| eval: false
dx run app-swiss-army-knife --clone <job-id>
```
When we learn about JSON (#sec-json), we'll learn a recipe for restarting a list of these failed jobs.
:::{.callout-note}
## Keep in Mind: `dx run --tag`
Keep in mind when you are using the `--tag` option with `dx run`, it does not tag the *outputs* of your jobs with that tag. It only tags the jobs.
:::
## dxFUSE: Simplify your scripts with multiple inputs {#sec-dxfuse}
Ok, we learned about submitting files as inputs. Is there a way to run scripts with less typing?
There is a general method for working in Swiss Army Knife and other apps that lets you bypass specifying file inputs explicitly in your `dx run` statement: using the [dxFUSE file system](https://github.com/dnanexus/dxfuse).
The main thing you need to know as a developer is that you can prepend a `/mnt/project/` to your file path to use in your `-icmd` input directly. For the `dx run` statement in @sec-plink, we can rewrite it as the following:
```{bash}
#| eval: false
#| filename: dx-run/run-job_dxfuse.sh
dx run app-swiss-army-knife \
-icmd="plink --bfile /mnt/project/data/plink/chr1 --geno 0.01 --make-bed --out chr1_qc"
```
In our `-icmd` input, we are referring to the file location of the triplet by pre-pending a `/mnt/project/` to the project path of the triplet: `data/chr1` to make a new path:
```
/mnt/project/data/chr1
```
This new path lets us specify files from project storage without directly specifying them as an input.
We could rewrite this command
```{bash}
#| eval: false
dx run app-swiss-army-knife \
-iin="data/NA12878.bam" \
-icmd="samtools view -c NA12878.bam > NA12878.counts.txt"
```
As this:
```{bash}
#| eval: false
dx run app-swiss-army-knife \
-icmd="samtools view -c /mnt/project/data/NA12878.bam > NA12878.counts.txt"
```
We can also use an `ls` on a dxFUSE path in our `-icmd`. This is a pattern that will become especially helpful when we do process multiple files on a single worker (@sec-mult-worker)
```{bash}
#| eval: false
dx run app-swiss-army-knife \
-icmd="ls /mnt/project/data/*.bam | xargs -I% sh -c 'samtools stats % > \$(basename %).stats.txt'" \
--destination "results/"
```
Our `cmd` input is kind of complicated here. Within the subshell, we're taking advantage of the `basename` command, which returns the bare filename of our object. We need this because we can't currently write using dxfuse.
```
sh -c 'samtools stats % > \$(basename %).stats.txt'
```
What's going on with the `\$(basename %)`? We need to use the parentheses (`()`) because it's in a subshell command, and we escape the `$` in front of it, so that our base shell doesn't expand it, since we want the shell on the worker to expand it instead.
So if our first file is `/mnt/project/data/NA12878.bam`, then the substitution on the worker goes like this:
```
samtools stats /mnt/project/data/NA12878.bam > NA12878.stats.txt
```
Most of these issues are because we need to specify subshells in our `cmd` and use `dxfuse`. Honestly, I would probably put the `xargs` code into a script, and submit that script into the worker, rather than specify it as a single line in our `cmd` input. (@sec-worker-script)
### Advantages of dxFUSE
Most of my scripts leverage dxFUSE. Let's talk about the advantages of using `dxFUSE`.
- **You don't have to specify files as inputs**. This avoids a lot of typing in your scripts. This can be very helpful if there are a number of files you need as inputs per job.
- **Allows you to stream files**. dxFUSE is a file streaming method. That is, it will stream files bit by bit to the worker. This can make the overall job run somewhat faster, as it can stream files as needed, and not have to explicitly transfer each file to the worker.
### Disadvantages of dxFUSE
There are some disadvantages to using dxFUSE you should be aware of:
- **It obscures your audit trail.** Because we are not specifying the files as inputs, it is less obvious as to which files were processed with each job.
- **It works best as read-only.** dxFUSE has limited write capabilities at this point, so it is best to treat it as a read-only file system.
- **You have to be aware of how it resolves duplicate file names in a folder**. dxFUSE
works differently than what you might expect when it encounters multiple file names. It separates out duplicate files by prefacing them with a number. For example, if you generated three files called `chr1.results.txt` in a folder called `data`, this is how you would refer to each of them:
```
data/chr1_results.txt
data/1/chr1_results.txt
data/2/chr1_results.txt
```
This name resolution can be tricky to work with.
- **It works best with good file hygiene practices.** One of the confusing things is that you can create files with the same filename in the same folder (see previous point). This can easily happen if you `clone` a job. Because these two files have unique identifiers, they are allowed to have identical file names on the platform. This is one reason to tag each set of jobs with a run number (such as `run001`), so you can make sure that all of the updated files are the correct version.
## Using a Docker Image with `dx run`/Swiss Army Knife
See the containers chapter (@sec-containers) for more information.
## What you learned in this chapter
Kudos and congrats for reaching this far. We built upon our shell scripting skills (@sec-script) and our knowledge of cloud-computing (@sec-cloud) to start running jobs effectively in the cloud.