Skip to content
Brian Haas edited this page Jul 27, 2020 · 174 revisions

Announcements: Now provides simplified integration for existing STAR-Fusion users.

Overview

CTAT-Mutations Pipeline is a variant calling pipeline focussed on detecting variants from RNA sequencing (RNA-seq) data. It integrates GATK Best Practices along with downstream steps to annotate and filter variants, and to additionally prioritize variants that may be relevant to cancer biology such as likely somatic mutations. Our variant annotation includes leveraging the RADAR and RediPortal databases for identifying likely RNA-editing events, dbSNP and gnomAD for annotating common variants, and COSMIC to highlight known cancer mutations. Finally, CRAVAT is leveraged to annotate and prioritize variants according to likely biological impact and relevance to cancer.

The CTAT Mutations pipeline is one of the components of the Trinity Cancer Transcriptome Analysis Toolkit (CTAT), complementing other functionality that leverages RNA-Seq data for characterizing cancer transcriptomes, including identification of fusion transcripts, copy number variations from tumor single cell transcriptomes, among other capabilities.

Our CTAT-Mutation pipeline aims to make variant discovery from rna-seq data as easy as possible, requiring only the RNA-seq reads as input, and generating summary reports and visualizations to help guide you to the most meaningful findings.

The following flowchart is a simplified visualization of the steps performed by the CTAT-Mutation Pipeline.

Installing CTAT-Mutations

The CTAT-Mutations pipeline requires the CTAT-Mutations software and companion genomic data resources. See our instructions for installing CTAT-Mutations for details.

Running the CTAT-Mutations Pipeline

Once the CTAT-Mutation Pipeline has successfully been installed along with the obligatory CTAT Genome Library, CTAT-Mutation Pipeline can be ran using the following command, only requiring the input reads.

   python /path/to/ctat_mutations \
   --left    : Path to the location of the left (ie. /1) paired end RNA-Seq Fastq file.
   --right   : Path to the location of the right (ie, /2) paired end RNA-Seq Fastq file. 
   --out_dir : Name to be given to the directory in which CTAT-Mutation outputs will 
               be placed. 

As inputs, CTAT-Mutation requires RNA-Seq reads in the form of a right and left paired-end FASTQ files, along with an output directory name where the pipeline products will be stored.

Example

A small sample data set is available for testing purposes. The pipeline can be ran on the sample data set by running the following command:

   python /path/to/ctat_mutations \
   --left reads_1.fq \
   --right reads_2.fq \
   --out_dir varcalling.outdir \

See our more detailed walk through tutorial leveraging these data.

Variant filtering and boosting methods

By default, GATK hard filters are applied to capture the most confident variant sites. Optionally, boosting methods can be applied to further augment prediction accuracy. We currently support an implementation of the RVBoost method (we call RVBLR for RVBoost-Like R), which can be incorporated using: the '--boosting_method RVBLR' parameter. Additional machine learning approaches are incorporated for reducing false positive calls, including gradient boosting, stochastic gradient boosting, adaptive boosting, and random forests. Hard-filtering and boosting methods are mutually exclusive, but you can first run with the default hard-filtering and then run again with the boosting option, and the boosting method will reuse preexisting outputs from the earlier execution where possible, speeding up the process. See Output section below for more details.

Output

The output from the CTAT-Mutations pipeline includes variant vcf files, summary tab-delimited reports, and interactive visualizations.

Variant Reports

The primary outputs include variants.HC_init.wAnnot.vcf.gz, containing HaplotypeCaller variant calls fully annotated. If boosting or hard-cutoffs are applied for filtering, additional corresponding vcfs are provided containing the unfiltered variant calls. A cancer.vcf and corresponding simpler summary cancer.tab file, are provided that contain a set of prioritized cancer-relevant variants detected in the sample. The cancer.vcf (VCF version 4.0) records the genetic variations, their locations, and additional annotation information. The cancer.tab is a tab-delimited file that contains the same variant information in a user-friendly format. There are additional outputs that are generated by the different stages of the CTAT-Mutations pipeline, as others are likely to be of interest as well for exploring RNA-editing or common variants. Documentation is provided for all such output files and formats.

Variant Visualization

You will also find an html page output named "igvjs_viewer.html" (based on igv-reports ), which allows for dynamic navigation of the identified cancer variants and the read evidence supporting their identification. This file can be simply opened in your web browser. An example view is shown below.

mut_view2

More info for exploring the variant visualization framework is available here.

The igv-reports based html report derives from a collaboration with James Robinson.

CTAT-Mutations Variant detection accuracy

We've assessed performance of the CTAT-Mutations pipeline using a variety of methods, including the Genome in a Bottle reference data and by applying our pipeline to cancer data sets having matched rna-seq and exome data.

To examine our performance assessment, please visit our Performance Assessment Report.

User support

Contact us on our google group https://groups.google.com/forum/#!forum/trinity_ctat_users

We aim to be responsive with user support. You will be responded to within hours time, generally (not days or weeks).

Funding

CTAT-Mutations is supported as part of the Trinity CTAT Project, funded by the National Cancer Institute Informatics Technology for Cancer Research