Skip to content
lassemaretty edited this page Feb 2, 2015 · 3 revisions

Bayesembler command-line

Required arguments

-b, --bam-file TopHat2 bamfile containing the mapped paired-end reads to the reference genome

Optional arguments

-o, --output-prefix prefix to be used on all output filenames (OBS: to use output directory <dir> without any prefix on the output files, use <dir/>)
-p, --num-threads [1] number of threads used for assembly (actual number of threads: <num-threads> + two I/O threads)
-s, --strand-specific [unstranded] data is strand-specific. Use "first" to indicate mate orientation as in the dUTP protocol or "second" if opposite
-c, --confidence-threshold [0.5] exclude candidates with a confidence below <confidence-threshold>
-f, --count-threshold [12] exclude candidates with an expected fragment count below <count-threshold>
-m, --no-pre-mRNA do not include pre-mRNA in the set of candidates

Advanced arguments

--seed [unix time] seed for pseudo-random number generator (produces reproducible results only when running with one thread)
--output-mode [assembly] output a gtf file of the assembly. Use "full" to output additional information on the assembly, candidates and fragment length distribution
--library-size [0] the total number of sequenced paired-end reads used for FPKM normalisation. Use value of 0 (default) to normalise using the number of mapped paired-end reads
--keep-temp-files keep filtered bam, instance and processsam log-file
--max-candidate-number [100] maximum number of candidates per splice-graph (used for graph pruning)
--dirichlet-parameter [1] abundance prior parameter (i.e. symmetric Dirichlet concentration parameter gamma)
--frag-mean disable internal fragment length mean estimation and use
--frag-sd disable internal fragment length sd estimation and use
--gibbs-iteration-scaling [60] scaling factor used to calculate number of Gibbs iterations (burn-in & sample-size) using: burn-in = 1000 + * number_of_candidates, sample-size = 10 * burn-in

Output

The Bayesembler outputs the assembly in Gene Annotation Format (GTF) as assembly.gtf. The individual elements of the attribute list in the GTF file are defined as follows:

  • gene_id: Splice-graph id.
  • transcript_id: Candidate id (unique across splice-graphs).
  • transcript_confidence: The fraction of Gibbs iterations a transcript was expressed.
  • FPKM: Mean abundance estimate normalized to effective transcript length and library size.
  • FPKM_sd: Standard deviation of the normalized abundance estimate.
  • expected_count: Expected paired-end read count.