-
Notifications
You must be signed in to change notification settings - Fork 34
Home
The Trinity Cancer Transcriptome Analysis Toolkit (CTAT) aims to provide tools for leveraging RNA-Seq to gain insights into the biology of cancer transcriptomes. Bioinformatics tool support is provided for mutation detection, fusion transcript identification, de novo transcript assembly of cancer-specific transcripts, lncRNA classification, and foreign transcript detection (viruses, microbes). CTAT is funded by the National Cancer Institute Informatics Technology for Cancer Research (NCI ITCR) program. Software tools and pipelines developed as components of Trinity CTAT are described below with links to the corresponding open source software, documentation, and tutorials.
CTAT-Mutations Pipeline is a variant calling pipeline focussed on detecting mutations from RNA sequencing (RNA-seq) data. It integrates GATK Best Practices along with downstream steps to annotate, filter, and prioritize cancer mutations. This includes leveraging the RADAR and RediPortal databases for identifying likely RNA-editing events, dbSNP for excluding common variants, and COSMIC to highlight known cancer mutations. Finally, CRAVAT is leveraged to annotate and prioritize variants according to likely biological impact and relevance to cancer.
The Trinity CTAT Mutations Pipeline is available at https://github.com/NCIP/ctat-mutations/wiki
Detection of cancer fusion transcripts in CTAT is a multi-pronged process involving the use of several alternative individual methods for predicting fusions followed by in silico validation and annotation. Software tools developed as part of CTAT include STAR-Fusion as a highly efficient reference genome read-mapping approach, and TrinityFusion to leverage de novo transcriptome assembly for fusion detection.
All predicted fusions can be subject to in silico validation using our CTAT FusionInspector, which re-evaluates the evidence for fusions predicted by any of the above methods, re-scores the predictions, and uses Trinity to de novo reconstruct likely fusion transcript sequences.
FusionInspector ships with STAR-Fusion above as a companion module, but can also be downloaded and installed separately if needed. FusionInspector can be found at https://github.com/FusionInspector/FusionInspector/wiki
STAR-Fusion and TrinityFusion are published in Genome Biology volume 20, Article number: 213 (2019).
An example Terra workspace is available here.
Certain introns are more likely to be relevant to cancer biology, representing cancer-specific isoforms that may result from alternative splicing or stem from intra-gene genomic deletions. For example, EGFR-vIII, EGFR-IVa, and EGFR-IVb are known oncogenic isoforms of the EGFR gene that are often found in glioblastomas and result from intra-gene deletion of exons that are observed as skipped in expressed isoforms. Another well-known example is a deletion of exon 19 in the MET gene, frequently found in lung cancers. Other splicing patterns that are relevant to cancer biology are evident from comparing large transcriptome data sets of tumor and normal tissues.
Our CTAT-Splicing Module interfaces nicely with other components of Trinity CTAT and you can run it as a post-process to mutation and/or fusion detection to generate cancer splicing reports.
Analysis of single cell transcriptome data to better understand cancer heterogeneity is a growing focus of Trinity CTAT. We are working to update our existing computational components to better leverage single cell transcriptome data, including identifying mutations and fusion transcripts that contribute to tumor heterogeneity.
Among these efforts, we developed an application inferCNV to identify largescale copy number variations (CNV) evident from single cell expression data. Many more contributions are expected to follow shortly.
We developed DISCASM to assist in the de novo assembly of cancer-specific transcripts. DISCASM restricts de novo transcriptome assembly to those reads that map discordantly or fail to map to the reference genome sequence. Such transcripts are enriched for those that target regions that are restructured or altogether missing from the human reference genome, such as fusion transcripts or those derived from foreign sources (viruses, microbes). Installation of this tool can be through conda or Galaxy toolshed as well.
The Trinity CTAT DISCASM Pipeline is available at https://github.com/DISCASM/DISCASM/wiki
DISCASM is a module used by our TrinityFusion software, and has been demonstrated of reconstructing tumor viruses present within cancer RNA-seq data sets.
For identification and classification of long noncoding RNAs, we employ Slncky. From a set of reconstructed transcripts, slncky identifies a high-quality set of lncRNA candidates and searches for conserved lncRNAs using a sensitive noncoding aligning method. Trinity genome-free de novo reconstructed transcripts or genome-based transcript reconstructions can be leveraged as input.
For foreign transcript detection, we leverage KrakenUniq, leveraging RNA-Seq reads and Trinity-reconstructed transcripts. Our efforts here are being carried out in collaboration with the group of Steven Salzberg at JHU.
We developed VirusIntegrationFinder[https://github.com/broadinstitute/CTAT-VirusIntegrationFinder] to identify sites of viral integration in the human genome. This is particularly relevant to human papillomavirus (HPV) but can be used to explore other types of viruses too.
For analysing human and SARS-CoV-2 RNA-Seq bulk or single cell data, we developed the rna_seq_sars_cov_2 workflow.
Features:
- Aligns to human and SARS-CoV-2 genomes
- Quantifies gene expression using RSEM (bulk and Smart-Seq2) or cellranger count (10x single cell)
- Jointly recalibrates base quality scores using reads from human and SARS-CoV-2 genomes
- Calls variants from SARS-CoV-2 and human reads
- Assembles SARS-CoV-2 genome using Trinity
- Aligns assembled genome to reference using minimap2
- Computes UMAP embedding, clusters, find differentially expressed genes, and provides visualization using cumulus (10x or Smart-Seq2 single cell data)
For Trinity CTAT applications, we aim to enable installation from a variety of sources:
- Software releases from GitHub (minimum for all projects)
- git cloning the 'master' branch from GitHub, which should reflect the most current release.
- Docker, Singularity
- the Terra cloud computing framework
See each of the separate project repos for tool-specific installation options.
Contact us via our google group: https://groups.google.com/forum/#!forum/trinity_ctat_users
Trinity CTAT is funded by the National Cancer Institute Informatics Technology for Cancer Research
Our efforts related to building a Trinity Cancer Transcriptome Analysis Toolkit are described in this Youtube screencast: