README

BPA-2.1.0 National Center for Genome Resources (NCGR)

Date: 2014_02_03 
Version: 2.1.0

Contact Information:  
	ctc@ncgr.org
	kschilling@ncgr.org
	sguida@ncgr.org


Description:

Batch Parallel Assembly (BPA) version 2.1.0 consists of eight modes used for de novo transcriptome assembly from short read data (e.g. 2x50 Illumina Hi-Seq data), as well as read realignment and annotation. The modes included in BPA-2.1.0 are: Unitigs, OLC, Scaffold, Gapclose, Realign, ESTScan, BLAST and HMMscan. Each mode in BPA-2.1.0 is run individually, giving the user flexibility to run only the modes of interest. Modes can be run using data generated from the previous step or with external user files.

For additional information specific to the Marine Microbial Eukaryote Transcriptome Sequencing Project, see README_MMETSP.


Modes:

1) Unitigs: sequence reads are assembled into unitigs with ABySS [1] using a pre-defined range of kmers. Unitigs from all kmer assemblies are combined and redundancies removed using CD-HIT-EST [2]. 

2) OLC (overlap layout consensus): the OLC assembler CAP3 [3] is used to identify overlaps between unitigs and assemble larger sequence contigs. 

3) Scaffold: contigs are paired-end scaffolded using ABySS [1]. 

4) Gapclose: sequence read pairing information is used in GapCloser [4, part of SOAP de novo package] to walk in on gaps created during scaffolding.

5) Realign: BWA [5] is used to align sequence reads to a given reference.

6) ESTScan: coding sequences are predicted using ESTScan [6,7] with a user provided scoring matrix. 

7) BLAST: BLASTx [8] is used to annotate ESTs from hits against selected database(s).

8) HMMscan: protein sequences are functionally characterized using HMMER3 [9] against Pfam-A [10], TIGRFAM [11], and SUPERFAMILY [12] databases.


References:

[1] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, I. Birol. 2009.  ABySS: a parallel assembler for short read sequence data. Genome Res., 19: 1117–1123.

[2] Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658-1659.

[3] X. Huang and A. Madan. 1999. CAP3: A DNA sequence assembly program. Genome Res., 9: 868-77.

[4] R. Li, Y. Li, K. Kristiansen, and J. Wang. 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics., 25: 713-714.

[6] C. Iseli, C. V. Jongeneel, and P. Bucher. 1999. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc Int Conf Intell Syst Mol Biol, 138-48.

[7] C. Lottaz, C. Iseli, C. V. Jongeneel, and P. Bucher. 2003. Modeling sequencing errors by combining Hidden Markov models. Bioinformatics, 19: ii103-ii112. 

[8] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3): 403-410.

[9] Z. Zhang and W. I. Wood. 2003. A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics., 19: 307-308.

[10] Robert D. Finn, John Tate, Jaina Mistry, Penny C. Coggill, Stephen John Sammut, Hans-Rudolf Hotz, Goran Ceric, Kristoffer Forslund, Sean R. Eddy, Erik L. L. Sonnhammer, and Alex Bateman. 2010. The Pfam protein families database. Nucleic Acids Res., 38: D211-222. 

[11] Daniel H. Haft, Brendan J. Loftus, Delwood L. Richardson, Fan Yang, Jonathan A. Eisen, Ian T. Paulsen, and Owen White. 2001. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucl. Acids Res., 29(1): 41-43.

[12] Julian Gough, Kevin Karplus, Richard Hughey, and Cyrus Chothia. 2001. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. Journal of Molecular Biology, 313(4): 903–919.


Installation: 

The following dependencies are required with the versions listed or newer. Be sure to export these, as well as the rbpa directory from GitHub, into your PATH before running BPA-2.1.0. 

ABySS 1.3.3 (http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases)

CD-HIT V4.5.5 (https://code.google.com/p/cdhit/downloads/list)

CAP3 2007/10/15 (http://seq.cs.iastate.edu/cap3.html)

GapCloser for SOAPdenovo v1.10 (http://soap.genomics.org.cn/index.html#down2)

ESTScan 3.0.3 (http://sourceforge.net/projects/estscan/)

ncbi_blast+ 2.2.26 (ftp://ftp.ncbi.nlm.nih.gov/blast/db/)

HMMER 3.0 (http://hmmer.janelia.org/software/archive)


Usage: 

Wrapper:

	run-bpa.pl is the wrapper script for the BPA pipeline. ALWAYS USE FULL PATHS IN COMMANDS.

        usage: run-bpa.pl <-m mode>

	mode: Unitigs, OLC, Scaffold, Gapclose, Realign, BLAST, ESTscan, HMMscan

	run-bpa.pl -m mode -h will return the usage for the given mode.


Unitigs:

	Unitigs mode must be run on an MPI enabled SGE cluster. Data cannot be gzipped.
	
	Usage: run-bpa.pl -m unitigs <-i "fastq(s)"> <-c> <-k> [-v]

	-i) fastq input file(s). If multiple fastq files are provided please utilize quotations (can use wildcard).
	-c) number of cores to use on SGE cluster. This implies PE ORTE.
	-k) kmer list; text file with one line per value.
	-v) minimum kmer coverage [5].

	example: run-bpa.pl -m unitigs -i "example.1.fastq example.2.fastq" -c 128 –k kmer.txt 
	
	output file(s): abyss/*/abyssrun-unitigs.fa


OLC:

	Usage: run-bpa.pl -m OLC  <-i "fasta(s)"> [-t threads] [-u "cap3 options"] [-e <= 1.0]

	-i) fasta input file(s). If multiple fasta files are provided please utilize quotations (can use wildcard).
	-t) threads to be used in CD-HIT-EST.
	-u) CAP3 command [-o100 -h50]. Use quotations for multiple options.
	-e) percent identity for CD-HIT-EST. If not provided, reduction will not be performed.

	example: run-bpa.pl -m OLC -i "/*/abyssrun-3.fa" -t 0 -u "-o 100 -h 20" -e .98

	output file: overlap/cap3_final.fna
 

Scaffold:

	Usage: run-bpa.pl -m scaffold <-r "read file(s)"> <-R reference.fa> <-k kmer> [-i maximum mismatch bwa] [-p min pairs] [-o output prefix] [-s seed length] [-t threads]

	-r) fastq read file(s). Expected orientation is forward reverse.
	-R) fasta scaffold reference. Sequences will be renamed sequentially.
	-k) kmer value to use in ABySS. Suggest using read length or value close to read length.
	-i) maximum mismatch for Burrows-Wheeler aligner [0.04]
	-p) minimum pairing evidence to use for scaffolds [5]
	-o) output file prefix [bpa_out] 
	-s) minimum scaffold seed length [200]
	-t) threads [2]

	example: run-bpa.pl -m scaffold -r "example.1.fastq example.2.fastq" -R overlap/cap3_final.fna -k 35 -p 3 -s 302 -t 3

	output file: scaffold/scaffolds.fna


Gapclose:

	Usage: run-bpa.pl -m gapclose <-f forward reads> <-r reverse reads> <-R fasta file to gap close> <-l maximum read length> [-i insert size] [-t threads]

	-f) mate one of fastq read files
	-r) mate two of fastq read files
	-R) fasta reference to gap close
	-l) maximum fastq read length
	-i) insert size. Must be provided if scaffolding directory is not contained in parent.
	-t) threads [1]

	example: run-bpa.pl -m gapclose -f example.1.fastq -r example.2.fastq -R scaffold/scaffolds.fna -l 50 -t 4

	output file: gapclose/gapclosed.fna


Realign:

	Usage: run-bpa.pl -m realign <-f forward reads> [-r reverse reads] <-R reference> [-o output prefix] [-t threads] [-m mismatch]

	-f) mate one of fastq read files. Multiple files may be included with quotations; order must match mate two
	-r) mate two of fastq read files. Multiple files may be included with quotations; order must match mate one
	-R) fasta reference file
	-o) output file prefix [BPA]
	-t) threads [1]
	-m) maximum mismatch in Burrows-Wheeler aligner [0.04]

	example: run-bpa.pl -m realign -f example.1.fastq -r example.2.fastq -R /gapclose/gapclosed.fna

	output file: BPA.bam


ESTscan:

        Usage: run-bpa.pl -m estscan <-i input file> <-s scoring matrix> [-o output prefix] [-l minimum result length]

        -i) fasta input file
        -s) scoring matrix. See documentation for ESTscan for information on how to create these
        -o) output prefix
        -l) minimum result length [30]

	example: run-bpa.pl -m estscan -i /gapclose/gapclosed.fna -s scoring_matrix.smat -l 150

	output file: estscan/motif_seqs.fna (coding sequences as nucleotides)
		     estscan/protein_motifs_raw.faa (coding sequences as amino acids)


BLAST:

	Usage: run-bpa.pl -m blast <-i input file> <-R blast formatted reference> [-o output filename] [-t threads] [-s shell] [-e e-value] [-f out format] [-g genetic code]

	-i) fasta input file to be used as query
	-R) blast formatted reference prefix
	-o) output filename [BPA]
	-t) threads [1]
	-s) shell or blast command; use quotations for command [blastx.bash]
	-e) e-value [1e-5]
	-f) blast output format [7]
	-g) genetic code to be used for translations [1]

	example: run-bpa.pl -m blast -i estscan/motif_seqs.fna -R uniprotkb_swissprot -t 4 -e 1e-20 -f 5 

	output file: BPA (default)


HMMscan:

	Usage: run-bpa.pl -m hmmscan <-i input file> <-d HMMER3 database> [-o output prefix] [-t threads]

	-i) peptide input file in fasta format
	-d) HMMER3 database to use with hmmscan
	-o) output file prefix [BPA]
	-t) threads [1]

	example: run-bpa.pl -m hmmscan -i estscan/protein_motifs_raw.faa -d superfam.hmm -o superfam -t 4

	output files: superfam.hits
		      superfam.hits.tbl


ACKNOWLEDGEMENTS:

We would like to thank John Crow and Robin Kramer for a lot of their initial designs on this framework while at NCGR.