Skip to content

HowToSetUpJAFFA

nadiadavidson edited this page Nov 7, 2020 · 27 revisions

In this wiki we describe how to install JAFFA and give some basic instructions to start running it. JAFFA is designed to be run on the bash command-line in linux. Having an understanding of bash (and R) would be useful to understand what the pipeline is doing, but isn't essential.

Installing

  1. Download the JAFFA package and untar it: tar -zxvf JAFFA-version-X.XX.tar.gz (replacing X.XX with the version number)
  2. Download the JAFFA reference files and untar inside the JAFFA-version-X.XX directory: tar -zxvf JAFFA_REFERENCE_FILES......tar.gz
  3. JAFFA version 1 only. This step is not necessary for version 2 as the genome is included in the reference file. Download the human genome version hg38, from UCSC and unzip it (gunzip hg38.fa.gz). By default JAFFA will be expecting this file to be in the root of the JAFFA code directory. You can either copy it there, create a symbolic link to it (ln -s <path_to_hg38> <path_to_JAFFA_directory>), provide the path to your hg38.fa file in the pipeline file JAFFA_stages.groovy or pass it when you run the JAFFA command. Note that JAFFA expects the UCSC version of the genome. Other versions (e.g. Ensembl) aren't compatible with JAFFA's reference files. This is also true if you are using hg19 or mm10.
  4. Before running JAFFA, there are quite a few other programs which must be installed. To make your life easier we have provided a script to automate this using wget. Run it in JAFFA's directory. When it's finished, check that all paths are filled in the file tools.groovy
./install_linux64.sh
  1. If you don't already have it, you will need to install R. Note that the R package, IRanges, must be installed.
  2. If needed, configure the JAFFA pipeline options for your data. Note, this is often not necessary as JAFFA can work out of the box with default values. Changing the defaults can be done either by editing the JAFFA_stages.groovy file, or by passing the parameters to bpipe when you run JAFFA. * readLayout - change to "single" if you have single-end reads otherwise paired-end is assumed. * genomeFasta - this is the path to the human genome. If you leave this unchanged it will default to the directory of the JAFFA package and use hg38. * fastqInputFormat - This tells bpipe how to split on samples and group of read pairs. The default should work if your reads are named like SampleA_1.fastq.gz SampleA_2.fastq.gz SampleB_1.fastq.gz SampleB_2.fastq.gz etc. JAFFA will create one directory for each sample. If you find this does not happen in a way you expect, you might need to adjust this variable. See the end of this bpipe doc page for more information. Also, you may need to change this parameter if your reads have the fq extension instead of fastq.

Input Type

The input to JAFFA should be either reads which have been gzipped. i.e. with an ending like ".fastq.gz" or a fasta file of contigs with an ending like ".fasta" (unzipped). JAFFA assumes there is one file (single-end) or a pair of files (paired-end) per sample.

Running

Create and change into the directory where you intend the output files of JAFFA to be placed. You then have a choice of four JAFFA running modes: Direct, Hybrid, Assembly, and Long. Which mode to use will depend on your read length and error rate.

When to use which mode?

  • For low error rate sequencing with 100bp reads or longer (most common), we recommend the direct mode, JAFFA_Direct.groovy. This would include Illumina sequencing as well as long read polished or assembled data.
  • For high error long reads, use the long mode, JAFFAL.groovy. This would include ONT data or raw PacBio.
  • For low error rate sequencing of 70-95bp, the hybrid mode is the most sensitive, JAFFA_hybrid.groovy. However, because it involves assembly, it requires a lot of memory and CPU time. If computational resources are a constraint, we recommend using the direct method.
  • For low error rate short reads of <70bp you can use the Assembly mode, JAFFA_assembly.groovy. Assembly may be useful if you are interested in the full transcript sequence of fusion genes as these will be reconstructed in this mode.

Direct

JAFFA will map reads to the known reference transcriptome and extract reads which do not map. It will then search for fusions from amongst the unmapped reads.

<path to JAFFA>/tools/bin/bpipe run <path to JAFFA>/JAFFA_direct.groovy <path_to_directory with fastq files>/*fastq.gz

In this mode, you can also search for fusions in pre-assembled transcriptomes, but providing a fasta file as input. In this case we skip the step where we filter for unmapped sequences.

<path to JAFFA>/tools/bin/bpipe run <path to JAFFA>/JAFFA_direct.groovy <path_to_directory with fasta file>/*.fasta

Long (JAFFAL)

For noisy long reads such as ONT or raw PacBio data, use JAFFAL, which is similar to the Direct pipeline in concept, but uses the accurate ONT aligner minimap2 to maximise sensitivity for fusion detection.

<path to JAFFA>/tools/bin/bpipe run <path to JAFFA>/JAFFAL.groovy <path_to_directory with fastq files>/*fastq.gz

Unzipped .fasta files may also be provided to the pipeline.

Assembly

JAFFA will call Velvet and Oases to assemble the reads. It will then search for fusions from amongst the assembled contigs.

<path to JAFFA>/tools/bin/bpipe run <path to JAFFA>/JAFFA.groovy <path_to_directory with fastq files>/*.gz

Hybrid

This is a combination of the Direct and Assembly modes. First JAFFA will call Velvet and Oases to assemble the reads. It will then search for fusions from amongst the assembled contigs. Next it will map reads to both the known reference transcriptome and the assembled transcriptome. It will then search for fusions from amongst the unmapped reads.

<path to JAFFA>/tools/bin/bpipe run <path to JAFFA>/JAFFA_hybrid.groovy <path_to_directory with fastq files>/*.gz
Clone this wiki locally