generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 2
/
part1-07-using-bioinfo-software.Rmd
406 lines (297 loc) · 25.4 KB
/
part1-07-using-bioinfo-software.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# Using (Bioinformatics) Software
Now that we’ve got the software installed for our example analysis, we’ll need to get some data. Supposing we don’t have any novel data to work with, we’ll ask the following question: can we identify, using HMMER and `muscle`, homologues of P450-1A1 genes in the *Drosophila melanogaster* protein data set? (This is a trivial example, as the *D. melanogaster* genome is already well annotated.)
### Getting Data {-}
The first step will be to download the *D. melanogaster* data set, which we can find on http://flybase.org, the genome repository for *Drosophila* genomes. Generally, genome repositories like FlyBase provide full data sets for download, but they can be difficult to find. To start, we’ll navigate to “Files,” then “Releases (FTP).”
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_22_flybase_home.png" />
</div>
From there, we’ll navigate to a recent release, like FB2014_05. Because genomic information is often being updated as better information becomes available, newer versions are indicated by newer releases.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_23_flybase_recent.png" />
</div>
Next we’ll see links for specific species; we’ll work with `dmel_r6.02`. It is often a good idea to note specific release numbers or release dates for data sets you download, for eventual description in the methods sections of any papers you write that are based on those data.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_24_flybase_dmel.png" />
</div>
The most common format for sequence information is known as FASTA and is the format we want, so the next step is to navigate to the `fasta` directory. Other potentially interesting options include the `gff` and `gtf` directories, which hold text-based annotation files describing the location and function of genes in the genome. (These formats are often useful for RNA-seq analysis.)
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_25_flybase_fasta.png" />
</div>
Finally, we see a variety of gzipped files we can `wget` on the command line. Because we are interested in the full protein data set for this species, we will use the URL for `dmel-all-translation-r6.02.fasta.gz`.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_26_flybase_all_translation.png" />
</div>
Before running `wget` on the file, we’ll create a `projects` directory in our home directory, and a `p450s` directory inside there to work in.
<pre id=part1-07-mkdir-proj
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/downloads]$ <b>cd $HOME</b>
[oneils@mbp ~]$ <b>mkdir projects</b>
[oneils@mbp ~]$ <b>cd projects</b>
[oneils@mbp ~/projects]$ <b>mkdir p450s</b>
[oneils@mbp ~/projects]$ <b>cd p450s</b>
[oneils@mbp ~/projects/p450s]$ <b>wget 'ftp://ftp.flybase.net/releases/FB2014_05/dmel_r6.02/fasta/dmel-all-translation-r6.02.fasta.gz'</b>
</code></pre>
Because the file is gzipped, we can use `gunzip` to decompress it, and then use `less -S` to view the results without the long lines wrapped in the terminal window.
<pre id=part1-07-gunzip
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>gunzip dmel-all-translation-r6.02.fasta.gz</b>
[oneils@mbp ~/projects/p450s]$ <b>less -S dmel-all-translation-r6.02.fasta</b>
</code></pre>
The result illustrates the standard format for a FASTA file. Each sequence record begins with line starting with a `>` character, and the first non-whitespace-containing word following that is considered the sequence ID.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_29_fasta_format.png" />
</div>
###### {- #whitespace}
This line might then contain *whitespace* characters and metadata. Whitespace comprises a sequence of one or more spaces, tabs (represented in Unix/Linux as a special character sometimes written as `\t`), or newlines (represented in Unix/Linux as a special character sometimes written as `\n`) in a row.
Lines following the header line contain the sequence information, and there is no specific format for the number of lines over which the sequence may be broken, or how long those lines should be. After the last sequence line for a record, a new sequence record may start.
Depending on the source of the FASTA file, the IDs or metadata may represent multiple pieces of data; in this example, the metadata are separated by spaces and have a `<label>=<value>`; format that is specific to protein sequences from FlyBase.
For our next trick, we’ll download some P450-1A1 protein sequences from Uniprot.org. Uniprot.org is a well-known protein database, and it is composed of the “TrEMBL” database and the subset of TrEMBL, known as “Swiss-Prot.” While the former contains many sequences with annotations, many of those annotations have been assigned by automated homology searches and have not been reviewed. The latter, Swiss-Prot, contains only sequences whose annotations have been manually reviewed.
For the download, we’ll enter “p450 1A1” into the search field, and we’ll filter the results to only those in Swiss-Prot by clicking on the “Reviewed” link, resulting in 28 matches. Next, we can click on the “Download” button to download a “FASTA (canonical)” (rather than with all isoforms included) file.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_30_uniprot_download.png" />
</div>
The Uniprot website recently underwent a redesign, such that the downloaded file is transferred directly to the web browser, rather than presented as a URL that could be accessed with `wget`. This isn’t a problem, as it gives us a chance to discuss how to transfer files between the remote system and local desktop via SFTP.
Like SSH, SFTP is a common client/server protocol. Provided the server is running on the remote computer (in fact, it uses the same port as SSH, port 22, because SSH provides the secure connection), we just need to install and run an SFTP client on our desktop. There are many SFTP clients available for Microsoft Windows (e.g., Core-FTP), OS X (e.g., Cyberduck), and Linux systems (e.g., `sftp` on the command line or the graphical FileZilla). The client discussed here is called FireFTP, and it is available as an extension for the Mozilla Firefox web browser (itself available for Windows, OS X, and Linux). To get it requires installing and running Firefox, navigating to Tools → Addons, and searching for “FireFTP.”
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_31_fireftp_search.png" />
</div>
Once the plugin is installed (which requires a restart of Firefox), we can access it from the Tools → Developer submenu. Connecting the client to a remote computer requires that we first configure the connection by selecting “Create an account.” The basic required information includes an account name, the host to connect to (e.g., an IP address like `128.196.64.120` or a host name like `files.institution.edu`), as well as our login name and password.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_32_fireftp_accttab1.png" />
</div>
We also need to tell the client which protocol to connect with, which is done on the “Connection” tab; we want SFTP on port 22.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_33_fireftp_accttab2.png" />
</div>
With that accomplished, we can transfer any file back and forth using the green arrows in the interface, where the remote filesystem is shown on the right and the local filesystem is shown on the left. Here’s the result after transferring our `p450s.fasta` file.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_34_fireftp_transfer.png" />
</div>
> #### DOS/Windows and Unix/Linux Newlines {-}
> For the most part, the way text is encoded on Microsoft operating systems (like DOS and Windows) and on Unix-like systems (like Linux and OS X) is similar. But there is one difference: how the ends of lines, or “newline characters” are represented. In Unix-like systems, a newline is represented by a single 8-bit byte (the “Line Feed” (NF) character): `00001010`. On Microsoft systems, they are represented by a pair of 8-bit bytes (“Carriage Return” (CR) followed by NF): `0000110100001010`. This means that text files created on Microsoft operating files may not be readable on Unix-like systems, and vice versa.
>
> Fortunately, there are utilities available for converting between these formats. On the command line, the utilities `dos2unix` and `unix2dos` convert to and from Unix-like format, respectively. This isn’t often an issue, as most file transfer programs (FireFTP included) automatically perform the appropriate conversion. The command-line utility `file` can also be used to determine the type of a file, including its newline type.
### Putting It All Together {-}
At this point, we’ve obtained the data we wish to process, and we’ve successfully installed the software we intend to run on that data.
<pre id=part1-07-together
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>ls</b>
dmel-all-translation-r6.02.fasta p450s.fasta
[oneils@mbp ~/projects/p450s]$ <b>which muscle</b>
/home/oneils/local/bin/muscle
[oneils@mbp ~/projects/p450s]$ <b>which hmmbuild</b>
/home/oneils/local/bin/hmmbuild
[oneils@mbp ~/projects/p450s]$ <b>which hmmsearch</b>
/home/oneils/local/bin/hmmsearch
</code></pre>
The first step is to run `muscle` on the `p450s.fasta` file to produce a multiple alignment. One quick way to identify how a program like `muscle` should be run (what parameters it takes) is to run it without any parameters. Alternatively, we could try the most common options for getting help: `muscle -h`, `muscle --help`, `muscle --h`, or `muscle -help`.
<pre id=part1-07-muscle-help
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>muscle</b>
MUSCLE v3.8.31 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
Basic usage
muscle -in <inputfile> -out <outputfile>
Common options (for a complete list please see the User Guide):
-in <inputfile> Input file in FASTA format (default stdin)
-out <outputfile> Output alignment in FASTA format (default stdout)
-diags Find diagonals (faster for similar sequences)
-maxiters <n> Maximum number of iterations (integer, default 16)
-maxhours <h> Maximum time to iterate in hours (default no limit)
-html Write output in HTML format (default FASTA)
...
</code></pre>
The most important line of this help text is the usage information: `muscle -in <inputfile> -out <outputfile>`; parameters in angle brackets indicate that the parameter is required (nonrequired parameters often appear in straight brackets). Further help text indicates other parameters that we could opt to add. Presumably, they could be placed before or after the input or output specifiers. So, we’ll run `muscle` on our `p450s.fasta` file, and produce a file whose file name indicates its pedigree in some way:
<pre id=part1-07-run-muscle
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>muscle -in p450s.fasta -out p450s.fasta.aln</b>
</code></pre>
Once the command has finished executing, we can view the alignment file with `less -S p450s.fasta.aln`.
<pre id=part1-07-less-result
class="language-txt
line-numbers
linkable-line-numbers">
<code>
<b>>sp|P05176|CP1A1_RABIT Cytochrome P450 1A1 OS=Oryctolagus cuniculus GN=CYP1A1 PE
------------------------------------------------------------
---------------------------------------------------MVSDFGLPT
FISATELLLASAVFCLVFWVAGASKPRVPKGLKRLPGPWGWPLLGHVLTLGK---NPHV-
---ALARLSRRYGDVFQIRLGSTPVVVLSGLDTIKQALVRQGDDFKGRPDLYSFSFVTK-
-----------------------------GQSMIFGSDSGPV---------WAARRRLAQ
NALNSFSVAS--------------------------------------------------
----------------DPASSSSCYLEEHV---SQEAENLISK-FQELMAAVGH------</b>
</code></pre>
With further inspection, we’d see that the sequences have been made the same length by the insertion of gap characters. The next step is to run `hmmbuild` to produce the HMM profile. Again, we’ll run `hmmbuild` without any options to get information on what parameters it needs.
<pre id=part1-07-hmmbuild-help
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>hmmbuild</b>
Incorrect number of command line arguments.
Usage: hmmbuild [-options] <hmmfile_out> <msafile>
where basic options are:
-h : show brief help on version and usage
-n <s> : name the HMM <s>
-o <f> : direct summary output to file <f>, not stdout
-O <f> : resave annotated, possibly modified MSA to file <f>
To see more help on other available options, do:
hmmbuild -h
</code></pre>
The help output for `hmmbuild` is shorter, though the command also notes that we could run `hmmbuild -h` for more detailed information. The usage line, `hmmbuild [-options] <hmmfile_out> <msafile>`, indicates that the last two parameters are required, and are the name of the output file (for the profile HMM) and the multiple sequence alignment input file. The brackets indicate that, before these last two parameters, a number of optional parameters may be given, described later in the help output. In this case, `<hmmfile_out>` and `<msafile>` are positional: the second-to-last argument must specify the output, and the last must specify the input.
<pre id=part1-07-hmmbuild
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>hmmbuild p450s.fasta.aln.hmm p450s.fasta.aln</b>
</code></pre>
After this operation finishes, it may be interesting to take a look at the resulting HMM file with `less -S p450s.fasta.aln.hmm`. Here’s a snippet:
<pre id=part1-07-hmmbuild-less
class="language-txt
line-numbers
linkable-line-numbers">
<code>
<b>CKSUM 797170701
STATS LOCAL MSV -12.2089 0.69650
STATS LOCAL VITERBI -13.2160 0.69650
STATS LOCAL FORWARD -6.4971 0.69650
HMM A C D E F G H I
m->m m->i m->d i->m i->i d->m d->d
COMPO 2.56763 4.18442 2.94826 2.71355 3.22576 2.92226 3.70840 2.94423
2.68504 4.42336 2.77569 2.73176 3.46465 2.40496 3.72606 3.29362
0.66606 1.72211 1.17901 3.05405 0.04832 0.00000 *
1 3.15916 4.51751 4.70912 4.16707 3.23103 4.39232 4.76778 1.44083</b>
</code></pre>
With some documentation reading, we may even be able to decode how the probabilistic profile is represented in this matrix of letters and numbers. As a reminder, our project directory now contains the original sequence file, a multiple-alignment file, and the HMM profile file, as well as the *D. melanogaster* protein file in which we wish to search for the profile.
<pre id=part1-07-ls
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>ls</b>
dmel-all-translation-r6.02.fasta p450s.fasta.aln
p450s.fasta p450s.fasta.aln.hmm
</code></pre>
At this point, we are ready to search for the profile in the *D. melanogaster* protein set with `hmmsearch`. As usual, we’ll first inspect the usage for `hmmsearch`.
<pre id=part1-07-hmmsearch-help
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>hmmsearch</b>
Incorrect number of command line arguments.
Usage: hmmsearch [options] <hmmfile> <seqdb>
where most common options are:
-h : show brief help on version and usage
To see more help on available options, do hmmsearch -h
</code></pre>
This brief help text indicates that `hmmsearch` may take a number of optional parameters (and we’d have to run `hmmsearch -h` to see them), and the last two parameters are required. These last two parameters constitute the HMM profile file we are searching for, as well as the `<seqdb>` in which to search. It doesn’t say what format `<seqdb>` should be, so we’ll try it on our *D. melanogaster* FASTA file and hope for the best (if it fails, we’ll have to read more help text or documentation).
<pre id=part1-07-hmmsearch
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>hmmsearch p450s.fasta.aln.hmm dmel-all-trans
lation-r6.02.fasta</b>
</code></pre>
Note that there was no required option for an output file. Running this command causes quite a lot of information to be printed to the terminal, including lines like:
<pre id=part1-07-hmmsearch-out
class="language-txt
line-numbers
linkable-line-numbers">
<code><b>...
Alignments for each domain:
== domain 1 score: 2.4 bits; conditional E-value: 0.19
p450s.fasta 433 sdekivpivndlfganfdtisvalswslpylvaspeigkklkke 476
+ +++ f a+f+ is l++ +l +p ++ +l++e
FBpp0086933 305 TADDLLAQCLLFFFAGFEIISSSLCFLTHELCLNPTVQDRLYEE 348
555566666667899***************************99 PP
...</b>
</code></pre>
And, when we run ls, we find that no output file has been created.
<pre id=part1-07-ls-no-out
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>ls</b>
dmel-all-translation-r6.02.fasta p450s.fasta.aln
p450s.fasta p450s.fasta.aln.hmm
</code></pre>
###### {- #standard_output}
It seems that `hmmsearch`, by default, prints all of its meaningful output to the terminal. Actually, `hmmsearch` is printing its output to the *standard output* stream. Standard output is the primary output mechanism for command-line programs (other than writing files directly). By default, standard output, also known as “standard out” or “stdout,” is printed to the terminal.
###### {- #redirect}
Fortunately, it is possible to *redirect* the standard output stream into a file by indicating this to our shell with a `>` redirect operator, and specifying a file name or path. In this case, we’ll redirect the output of standard out to a file called `p450s_hmmsearch_dmel.txt`.
<pre id=part1-07-hmmsearch-redirect
class="language-txt
line-numbers
linkable-line-numbers">
<code>
[oneils@mbp ~/projects/p450s]$ <b>hmmsearch p450s.fasta.aln.hmm dmel-all-trans
lation-r6.02.fasta > p450s_hmmsearch_dmel.txt</b>
</code></pre>
When this command executes, nothing is printed, and instead our file is created. When using the `>` redirect, the file will be overwritten if it already exists. If, instead, we wished to append to an existing file (or create a new file if there is no file to append to), we could have used the `>>` redirect.
Here are the contents of our final analysis, a simple text file with quite a bit of information, including some nicely formatted row and column data, as displayed by `less -S p450s_hmmsearch_dmel.txt`.
<pre id=part1-07-hmmsearch-output
class="language-txt
line-numbers
linkable-line-numbers">
<code>
<b>Query: p450s.fasta [M=730]
Scores for complete sequences (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Sequence Descript
------- ------ ----- ------- ------ ----- ---- -- -------- --------
1.2e-98 332.1 0.1 2e-98 331.3 0.1 1.2 1 FBpp0081483 type=pr
1.2e-98 332.1 0.1 2e-98 331.3 0.1 1.2 1 FBpp0307731 type=pr
7e-67 227.0 0.0 7e-67 227.0 0.0 1.5 1 FBpp0082768 type=pr
9.5e-67 226.6 0.0 9.5e-67 226.6 0.0 2.1 1 FBpp0297168 type=pr</b>
</code></pre>
### Reproducibility with Scripts {-}
It is highly unlikely that an analysis of this type is performed only once. More often than not, we’ll wish to adjust or replace the input files, compare to different protein sets, or adjust the parameters for the programs ran. Thus it make sense to capture the analysis we just performed as an executable script, perhaps called `runhmmer.sh`.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_49_unix_90_runhmmer_1.png" />
</div>
Note in the above that we’ve broken the long `hmmsearch` line into two by ending it midway with a backslash and continuing it on the next line. The backslash lets `bash` know that more of the command is to be specified on later lines. (The backslash should be the last character on the line, with no spaces or tabs following.) After making this script executable with `chmod`, we could then rerun the analysis by navigating to this directory and running `./runhmmer.sh`.
What if we wanted to change the input file, say, to `argonase-1s.fasta` instead of `p450s.fasta`? We could create a new project directory to work in, copy this script there, and then change all instances of `p450s.fasta` in the script to `argonase-1s.fasta`.
Alternatively, we could use the power of [environment variables](#environment_variable) to architect our script in such a way that this process is easier.
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_50_unix_92_runhmmer_2.png" />
</div>
Now the file names of interest are specified only once, near the top of the script, and from then on the script uses its own identifiers (as environment variables) to refer to them. Reusing this script would be as simple as changing the file names specified in three lines.
We can go a step further. It turns out that shell scripts can take parameters from the command line. The first parameter given to a script on the command line will be automatically stored in a variable accessible to the script called `$1`, the second parameter will be stored in `$2`, and so on. We can thus further generalize our script:
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_51_unix_93_runhmmer_3.png" />
</div>
We could have replaced all instances of `$query` with `$1`, but this organization makes our script easier to read in the future, an important consideration when programming. Now we can run a full analysis by specifying the three relevant file names on the command line, as in: `./runhmmer.sh p450s.fasta dmel-all-translation-r6.02.fasta p450s_hmmsearch_dmel.txt`.
This [`runhmmer.sh`](data/runhmmer.sh) is a good candidate for inclusion in our `$HOME/local/bin` so that we can run it from anywhere, though we may want to add lines immediately following the `#!` line, to provide some help text for anyone who attempts to run the script without providing the correct inputs:
<div class="fig center" style="width: 80%">
<img src="images/part1-07-using-bioinfo-software.Rmd.images/I.7_52_unix_94_runhmmer_4.png" />
</div>
The “if block” above will only execute if the number of parameters given (`$#`) is not equal to 3. Although languages like Python provide much nicer facilities for this sort of logic-based execution, the ability to conditionally provide usage information for scripts is important. As usual for `bash`, the interpreter ignores lines that start with `#`.
<div class="exercises">
#### Exercises {-}
1. Create a new folder in your `projects` folder called `c_elegans`. Locate the FASTA file for the reference genome of *Caenorhabditis elegans* from http://wormbase.org, and download it to this folder using `wget`. The file you are searching for will be named something like `c_elegans.PRJNA13758.WS244.genomic.fa.gz`. After it is downloaded, decompress it and view it with `less -S`.
2. Install an SFTP client on your desktop, like FireFTP or CyberDuck, and attempt to connect to the same machine you log in to via SFTP. Download a FASTA file of some potentially homologous sequences from Uniprot to your local desktop, and transfer it to your remote `c_elegans` directory.
3. Try running `muscle` and `HMMER` on the sequences you downloaded from uniprot.org against the *C. elegans* genome.
4. If you have access to more than one Unix-based machine (such as an OS X desktop and a remote Linux computer, or two remote Linux computers), read the man page for `scp` with `man scp`, and also read about it online. Try to transfer a file and a directory between machines using `scp` on the command line.
5. Write an executable `bash` script that automates a process of some kind, and install it in your `$HOME/local/bin`. Test it after logging out and back in.
</div>