-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
225 lines (127 loc) · 9.25 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
BPA-2.1.0 National Center for Genome Resources (NCGR)
Date: 2014_02_03
Version: 2.1.0
Contact Information:
Description:
Batch Parallel Assembly (BPA) version 2.1.0 consists of eight modes used for de novo transcriptome assembly from short read data (e.g. 2x50 Illumina Hi-Seq data), as well as read realignment and annotation. The modes included in BPA-2.1.0 are: Unitigs, OLC, Scaffold, Gapclose, Realign, ESTScan, BLAST and HMMscan. Each mode in BPA-2.1.0 is run individually, giving the user flexibility to run only the modes of interest. Modes can be run using data generated from the previous step or with external user files.
For additional information specific to the Marine Microbial Eukaryote Transcriptome Sequencing Project, see README_MMETSP.
Modes:
1) Unitigs: sequence reads are assembled into unitigs with ABySS [1] using a pre-defined range of kmers. Unitigs from all kmer assemblies are combined and redundancies removed using CD-HIT-EST [2].
2) OLC (overlap layout consensus): the OLC assembler CAP3 [3] is used to identify overlaps between unitigs and assemble larger sequence contigs.
3) Scaffold: contigs are paired-end scaffolded using ABySS [1].
4) Gapclose: sequence read pairing information is used in GapCloser [4, part of SOAP de novo package] to walk in on gaps created during scaffolding.
5) Realign: BWA [5] is used to align sequence reads to a given reference.
6) ESTScan: coding sequences are predicted using ESTScan [6,7] with a user provided scoring matrix.
7) BLAST: BLASTx [8] is used to annotate ESTs from hits against selected database(s).
8) HMMscan: protein sequences are functionally characterized using HMMER3 [9] against Pfam-A [10], TIGRFAM [11], and SUPERFAMILY [12] databases.
References:
[1] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, I. Birol. 2009. ABySS: a parallel assembler for short read sequence data. Genome Res., 19: 1117–1123.
[2] Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658-1659.
[3] X. Huang and A. Madan. 1999. CAP3: A DNA sequence assembly program. Genome Res., 9: 868-77.
[4] R. Li, Y. Li, K. Kristiansen, and J. Wang. 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics., 25: 713-714.
[6] C. Iseli, C. V. Jongeneel, and P. Bucher. 1999. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol, 138-48.
[7] C. Lottaz, C. Iseli, C. V. Jongeneel, and P. Bucher. 2003. Modeling sequencing errors by combining Hidden Markov models. Bioinformatics, 19: ii103-ii112.
[8] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3): 403-410.
[9] Z. Zhang and W. I. Wood. 2003. A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics., 19: 307-308.
[10] Robert D. Finn, John Tate, Jaina Mistry, Penny C. Coggill, Stephen John Sammut, Hans-Rudolf Hotz, Goran Ceric, Kristoffer Forslund, Sean R. Eddy, Erik L. L. Sonnhammer, and Alex Bateman. 2010. The Pfam protein families database. Nucleic Acids Res., 38: D211-222.
[11] Daniel H. Haft, Brendan J. Loftus, Delwood L. Richardson, Fan Yang, Jonathan A. Eisen, Ian T. Paulsen, and Owen White. 2001. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucl. Acids Res., 29(1): 41-43.
[12] Julian Gough, Kevin Karplus, Richard Hughey, and Cyrus Chothia. 2001. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology, 313(4): 903–919.
Installation:
The following dependencies are required with the versions listed or newer. Be sure to export these, as well as the rbpa directory from GitHub, into your PATH before running BPA-2.1.0.
ABySS 1.3.3 (http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases)
CD-HIT V4.5.5 (https://code.google.com/p/cdhit/downloads/list)
CAP3 2007/10/15 (http://seq.cs.iastate.edu/cap3.html)
GapCloser for SOAPdenovo v1.10 (http://soap.genomics.org.cn/index.html#down2)
ESTScan 3.0.3 (http://sourceforge.net/projects/estscan/)
ncbi_blast+ 2.2.26 (ftp://ftp.ncbi.nlm.nih.gov/blast/db/)
HMMER 3.0 (http://hmmer.janelia.org/software/archive)
Usage:
Wrapper:
run-bpa.pl is the wrapper script for the BPA pipeline. ALWAYS USE FULL PATHS IN COMMANDS.
usage: run-bpa.pl <-m mode>
mode: Unitigs, OLC, Scaffold, Gapclose, Realign, BLAST, ESTscan, HMMscan
run-bpa.pl -m mode -h will return the usage for the given mode.
Unitigs:
Unitigs mode must be run on an MPI enabled SGE cluster. Data cannot be gzipped.
Usage: run-bpa.pl -m unitigs <-i "fastq(s)"> <-c> <-k> [-v]
-i) fastq input file(s). If multiple fastq files are provided please utilize quotations (can use wildcard).
-c) number of cores to use on SGE cluster. This implies PE ORTE.
-k) kmer list; text file with one line per value.
-v) minimum kmer coverage [5].
example: run-bpa.pl -m unitigs -i "example.1.fastq example.2.fastq" -c 128 –k kmer.txt
output file(s): abyss/*/abyssrun-unitigs.fa
OLC:
Usage: run-bpa.pl -m OLC <-i "fasta(s)"> [-t threads] [-u "cap3 options"] [-e <= 1.0]
-i) fasta input file(s). If multiple fasta files are provided please utilize quotations (can use wildcard).
-t) threads to be used in CD-HIT-EST.
-u) CAP3 command [-o100 -h50]. Use quotations for multiple options.
-e) percent identity for CD-HIT-EST. If not provided, reduction will not be performed.
example: run-bpa.pl -m OLC -i "/*/abyssrun-3.fa" -t 0 -u "-o 100 -h 20" -e .98
output file: overlap/cap3_final.fna
Scaffold:
Usage: run-bpa.pl -m scaffold <-r "read file(s)"> <-R reference.fa> <-k kmer> [-i maximum mismatch bwa] [-p min pairs] [-o output prefix] [-s seed length] [-t threads]
-r) fastq read file(s). Expected orientation is forward reverse.
-R) fasta scaffold reference. Sequences will be renamed sequentially.
-k) kmer value to use in ABySS. Suggest using read length or value close to read length.
-i) maximum mismatch for Burrows-Wheeler aligner [0.04]
-p) minimum pairing evidence to use for scaffolds [5]
-o) output file prefix [bpa_out]
-s) minimum scaffold seed length [200]
-t) threads [2]
example: run-bpa.pl -m scaffold -r "example.1.fastq example.2.fastq" -R overlap/cap3_final.fna -k 35 -p 3 -s 302 -t 3
output file: scaffold/scaffolds.fna
Gapclose:
Usage: run-bpa.pl -m gapclose <-f forward reads> <-r reverse reads> <-R fasta file to gap close> <-l maximum read length> [-i insert size] [-t threads]
-f) mate one of fastq read files
-r) mate two of fastq read files
-R) fasta reference to gap close
-l) maximum fastq read length
-i) insert size. Must be provided if scaffolding directory is not contained in parent.
-t) threads [1]
example: run-bpa.pl -m gapclose -f example.1.fastq -r example.2.fastq -R scaffold/scaffolds.fna -l 50 -t 4
output file: gapclose/gapclosed.fna
Realign:
Usage: run-bpa.pl -m realign <-f forward reads> [-r reverse reads] <-R reference> [-o output prefix] [-t threads] [-m mismatch]
-f) mate one of fastq read files. Multiple files may be included with quotations; order must match mate two
-r) mate two of fastq read files. Multiple files may be included with quotations; order must match mate one
-R) fasta reference file
-o) output file prefix [BPA]
-t) threads [1]
-m) maximum mismatch in Burrows-Wheeler aligner [0.04]
example: run-bpa.pl -m realign -f example.1.fastq -r example.2.fastq -R /gapclose/gapclosed.fna
output file: BPA.bam
ESTscan:
Usage: run-bpa.pl -m estscan <-i input file> <-s scoring matrix> [-o output prefix] [-l minimum result length]
-i) fasta input file
-s) scoring matrix. See documentation for ESTscan for information on how to create these
-o) output prefix
-l) minimum result length [30]
example: run-bpa.pl -m estscan -i /gapclose/gapclosed.fna -s scoring_matrix.smat -l 150
output file: estscan/motif_seqs.fna (coding sequences as nucleotides)
estscan/protein_motifs_raw.faa (coding sequences as amino acids)
BLAST:
Usage: run-bpa.pl -m blast <-i input file> <-R blast formatted reference> [-o output filename] [-t threads] [-s shell] [-e e-value] [-f out format] [-g genetic code]
-i) fasta input file to be used as query
-R) blast formatted reference prefix
-o) output filename [BPA]
-t) threads [1]
-s) shell or blast command; use quotations for command [blastx.bash]
-e) e-value [1e-5]
-f) blast output format [7]
-g) genetic code to be used for translations [1]
example: run-bpa.pl -m blast -i estscan/motif_seqs.fna -R uniprotkb_swissprot -t 4 -e 1e-20 -f 5
output file: BPA (default)
HMMscan:
Usage: run-bpa.pl -m hmmscan <-i input file> <-d HMMER3 database> [-o output prefix] [-t threads]
-i) peptide input file in fasta format
-d) HMMER3 database to use with hmmscan
-o) output file prefix [BPA]
-t) threads [1]
example: run-bpa.pl -m hmmscan -i estscan/protein_motifs_raw.faa -d superfam.hmm -o superfam -t 4
output files: superfam.hits
superfam.hits.tbl
ACKNOWLEDGEMENTS:
We would like to thank John Crow and Robin Kramer for a lot of their initial designs on this framework while at NCGR.