Skip to content

Latest commit

 

History

History
343 lines (262 loc) · 28 KB

paper.md

File metadata and controls

343 lines (262 loc) · 28 KB
title title_short tags authors affiliations date bibliography event group authors_short
SnpReportR: A Tool for Clinical Reporting of RNAseq Expression and Variants
SnpReportR
Expression
Precision Medicine
Human Genomics
RNA-Seq
Clinical Report
Genetic testing
Reporting research results
Variants
Genetic test reports
name email orcid affiliation
Ahmad Al Khleifat
0000-0002-7406-9831
1
name email orcid affiliation
Jenny Leopoldina Smith
0000-0003-0402-2779
2
name email orcid affiliation
Brandon M. Blobner
0000-0003-0734-1492
3
name email orcid affiliation
Sierra D. Miller
000-0002-3200-428X
4
name email orcid affiliation
Kymberleigh Pagel
0000-0001-8544-925
5
name email affiliation
Annie Nadkarni
6
name email orcid affiliation
Melanie Gainey
0000-0002-4782-9647
7
name email affiliation
Patrick Campbell
7
name email orcid affiliation
Olaitan I. Awe
0000-0002-4257-3611
8
name email orcid
Manuel Belmadani
0000-0002-5820-5979
name email orcid affiliation
Alan M. Cleary
0000-0002-6567-5346
9
name email orcid affiliation
Nicholas P. Cooley
0000-0002-6029-304X
10
name email orcid affiliation
Shamika Dhuri
0000-0003-3931-5694
11
name email orcid affiliation
Virginie Grosboillot
0000-0002-8249-7182
12
name email affiliation
Brian Haas
13
name email orcid affiliation
Samuel Hokin
0000-0002-9746-2783
9
name email orcid affiliation
Ekaterina Orlova
0000-0002-7129-5796
14
name email orcid affiliation
Meghana Pagadala
0000-0002-7591-6035
15
name email orcid affiliation
Stephen Price
0000-0001-9072-7518
16
name email orcid affiliation
Adelaide Rhodes
0000-0002-1714-1972
17
name email
Janice Kyla Nascimento Smith
name email orcid affiliation
Chaitanya Srinivasan
0000-0002-1487-541X
6
name email affiliation
Barry Zorman
18
name email orcid affiliation
Ben Busby
0000-0001-5267-4988
19
name index
Institute of Psychiatry, Psychology & Neuroscience, King's College London. Maurice Wohl Clinical Neuroscience Institute, 5 Cutcombe Road, Denmark Hill, London, SE5 9RX
1
name index
Clinical Research Division, Fred Hutchinson Cancer Research Center, 1100 Fairview, Ave., Seattle, WA, 98109
2
name index
Division of Gastroenterology, Hepatology and Nutrition, Department of Medicine, University of Pittsburgh, 4200 Fifth Avenue, Pittsburgh, PA 15260
3
name index
National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD 20899-8970
4
name index
Institute for Computational Medicine, Johns Hopkins University, 220 Hackerman Hall, 3400 N. Charles St. Baltimore, MD 21218
5
name index
Carnegie Mellon University, School of Computer Science, 5000 Forbes Ave, Pittsburgh, PA 15213
6
name index
Carnegie Mellon University Libraries, 4909 Frew Street, Pittsburgh, PA, 15213
7
name index
University of Ibadan
8
name index
National Center for Genome Resources, 2935 Rodeo Park Dr E, Santa Fe, NM 87505
9
name index
Department of Biomedical Informatics, University of Pittsburgh
10
name index
Carnegie Mellon University, Mellon College of Science, 5000 Forbes Ave, Pittsburgh, PA 15213
11
name index
Laboratory of Food Microbiology, Institute for Food, Nutrition and Health, ETH Zurich, Zurich, Switzerland
12
name index
Broad Institute 415 Main Street, Cambridge, MA 02142
13
name index
Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, 130 De Soto Street, Pittsburgh, PA, 15261
14
name index
University of California San Diego, 9500 Gilman Dr, La Jolla, CA 92093
15
name index
Carnegie Mellon University, 5000 Main Street, Pittsburgh, PA 15213
16
name index
National Center for Biotechnology Information, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
17
name index
Baylor College of Medicine, Houston, TX 77030
18
name index
DNAnexus, 1975 El Camino, Mountain View, CA
19
7 June 2021
paper.bib
Fukuoka2019
Logic programming group
Ahmad Al Khleifat & Jenny Leopoldina Smith \emph{et al.}

Abstract

With the increasing availability of next-generation sequencing (NGS), patients and non-specialist health care professionals are obtaining their genomic information without sufficient bioinformatics skills to analyze and interpret the data. In January 2021, four teams of scientists, clinicians, and developers from around the world worked collaboratively in a virtual hackathon to create a framework for the automated analysis and interpretation of RNA sequencing data in the clinic. Here, we present SnpReportR: A Tool for Clinical Reporting of RNAseq Expression and Variants aimed for use by clinicians and others without in-depth knowledge of genetics.

Introduction

In clinical practice and biomedical research, next-generation sequencing (NGS) and subsequent identification of genomic variants including single nucleotide variations or polymorphisms, small insertions or deletions, and structural variants is an established method used to investigate the genetic causes and associations of disease. While whole genome and whole exome sequencing are relatively expensive, RNA-sequencing (RNA-seq) is a highly cost-effective and versatile method that assays both gene sequence and expression, thus yielding both genetic and functional signatures in a single technique [@hong2020, @horak2016]. In cancer diagnostics, specifically, RNA-Seq provides a means of detecting nucleotide-level resolution of mutations within transcribed regions of oncogenes and tumor suppressor genes for expressed loci.

While RNA-seq is a powerful tool for characterizing tumors, identifying and interpreting the variants that are relevant to the diagnosis or understanding of disease within a patient remains challenging. The biopsy tissue sampled for the analysis can be composed of a mix of tumor and normal cells that leads to averaged gene expression data from a heterogeneous mixture of cells [@yoshihara2013; @fan2020]. Another practical challenge for using gene expression data is the need for sophisticated tools that may require significant bioinformatics expertise and high-performance computing to carry out the data processing and analysis [@fernald2011]. For those outside the immediate field of genetics such as researchers, hospital staff, the interpretation of findings is particularly challenging. Therefore, it is important to have tools that take into account these gaps in specialized bioinformatics knowledge.

We, therefore, developed SnpReportR, a clinical genetic reporting framework for the automated analysis and interpretation of RNA sequencing data. The software is designed for use by clinicians and other users without an in-depth background in genetics.

Methods

In a coordinated codeathon, teams worked collaboratively with the same dataset to provide a framework for clinical reporting of RNA-Seq research. The gene expression datasets used as test data in this study were obtained from publicly available databases from The DataBase of Kashiwa Encyclopedia of Researchers of multi-Omics data (DBKERO)[https://kero.hgc.jp/]. The raw cell files were reanalysed using HISAT2 and featureCounts to obtain summarised expression values for transcripts. Differential expression was performed by using edgeR. CAVIAR was then used to distinguish causal variants from other signals in the data, such as population stratification.The vcf files produced from the CTAT Mutations pipeline v2.5 are analyzed using vcfR v1.12.0 and the bioconductor package Variant Annotation v1.36.0. Genomic visualizations are completed with Gviz v1.34.1 and the snpReportR tool provides a helper function to create genomic references for use with Gviz . Finally, the tool generates two reports, one is aimed for clinical use and the second aimed for researchers, informing the interpretation of genetic variants pertaining to the gene provided by the user. Both reports provide RNA-Seq summary analysis which includes diseases association, disease gene identification, gene prioritization, tissue expression information, supported with data visualisation and 5 recent publications related to the identified gene. Tissue expression, disease association, publication, and snp data for differentially expressed genes is extracted from HumanMine [@smith2012] using the InterMineR v1.1 R package. These multifaceted and integrated results were used to construct an interactive clinical report, which is sent to the user via gmailr [@hester2019] and the HTML email produced using the Blastula v.0.3.2. The report is generated using a customized Rmarkdown template with parameterized inputs, which allows for easy customization.

Test Data

For demonstration purposes, we identified a dataset that was agreeable for all teams’ efforts and was used solely as an example for showcasing the workflow. This does not represent the kind of patient-level data that would be best suited to this analysis pipeline. The resulting dataset is a lung cancer dataset [@suzuki2019]; ENA accession: PRJDB6952 corresponding to 23 lung cancer cell lines treated with 95 compounds including approved receptor tyrosine kinase inhibitors and epigenetic targeting drugs. High-throughput RNA-seq was conducted with four different drug concentrations at three time points (24,48,72h).

Table 1. Three cell lines and four treatments were included for testing (124 total samples). [@suzuki2019]; ENA accession: PRJDB6952

Cell Line Drug Number of Samples
A549 (+)-JQ1 (Inhibitor_BET (BRD4)) 12
DMSO (Control) 9
Etoposide (Inhibitor_Topo II) 11
Temsirolimus (Inhibitor_mTOR) 11
H1299 (+)-JQ1 (Inhibitor_BET (BRD4)) 11
DMSO (Control) 9
Etoposide (Inhibitor_Topo II) 12
Temsirolimus (Inhibitor_mTOR) 11
II-18 (+)-JQ1 (Inhibitor_BET (BRD4)) 8
DMSO (Control) 8
Etoposide (Inhibitor_Topo II) 11
Temsirolimus (Inhibitor_mTOR) 11

Variant Calling and Annotation

To identify variants, the CTAT Mutations pipeline was used with the FASTQ files mentioned above with the GRCh38 human reference. This pipeline integrates ‘GATK Best Practices’ with additional operations to annotate and filter variants and to prioritize variants that may be relevant to cancer biology. The CTAT pipeline then annotates variants with RADAR [@zhang2020] and RediPortal [@picardi2017] databases for identifying likely RNA-editing events, COSMIC [@tate2019] to highlight known cancer mutations, and dbSNP [@sherry2001] and gnomAD [@karczewski2020] to annotate common variants. OpenCRAVAT [@pagel2020] is subsequently used to annotate and prioritize variants according to likely biological impact and relevance to cancer. The cancer-related annotations from OpenCRAVAT included ClinVar [@shihab2013], CHASMplus [@tokheim2019], MuPIT [@niknafs2013], VEST [@carter2013] and FATHMM [@shihab2013]. We encoded the pipeline using Workflow Description Language (WDL) and deployed the workflow on the DNANexus cloud computing platform as an app. To run the pipeline on DNANexus, create a new workflow and then select the “Trinity CTAT” app from the Tool Library. The tool takes three inputs: left FASTQ, right FASTQ, and the CTAT Genome Library, which can be obtained from the STAR-Fusion [@haas2019] GitHub page.

Analysis of gene expression

To incorporate differential expression results in the clinical report, a subset of the Suzuki et al. (2019, ENA accession: PRJDB695) test data (Table 2) were used. The differential expression testing was performed to obtain results and formatting information only and was not evaluated for biological impact. All analyses were performed in Galaxy for ease of use. Raw RNA-Seq reads were aligned to GRCh38 using HISAT2 (v.2.1.0) [@kim2019], normalized counts were estimated using featureCounts (v.1.6.4) [@liao2014], and differential expression testing was performed using edgeR (v.3.24.1) [@robinson2010].

Table 2. Reduced dataset for differential expression testing.

Run BioProject Cell Line Sample Name Drug Concentration ($\mu$M)
DRR131576 DRR131576 A549 RNA-seq_A549_24h_B07_Etoposide (Inhibitor_Topo II)_1 Etoposide (Inhibitor_Topo II) 1.0
DRR131588 PRJDB6952 A549 RNA-seq_A549_24h_C07_Etoposide (Inhibitor_Topo II)_0.1 Etoposide (Inhibitor_Topo II) 0.1
DRR131599 PRJDB6952 A549 RNA-seq_A549_24h_D07_Etoposide (Inhibitor_Topo II)_0.01 Etoposide (Inhibitor_Topo II) 0.01
DRR132310 PRJDB6952 H1299 RNA-seq_H1299_24h_B07_Etoposide (Inhibitor_Topo II)_1 Etoposide (Inhibitor_Topo II) 1.0
DRR132321 PRJDB6952 H1299 RNA-seq_H1299_24h_C07_Etoposide (Inhibitor_Topo II)_0.1 Etoposide (Inhibitor_Topo II) 0.1
DRR132333 PRJDB6952 H1299 RNA-seq_H1299_24h_D07_Etoposide (Inhibitor_Topo II)_0.01 Etoposide (Inhibitor_Topo II) 0.01

Gene and loci identification

The DSVifier pipeline is comprised of two tracks that are processed in parallel with the existing bioinformatics tools Somalier [@pedersen2020] and CAVIAR [@hormozdiari2014] and is designed to identify the variants correlating with a particular disease. Somalier requires a single merged VCF input file, and produces TSV and HTML file outputsthat describe aspects of the input variants, such as their relatedness and ancestry. Somalier extracts informative sites, evaluates relatedness, and performs quality-control. CAVIAR requires two additional input tab-delimited input files; specifically, Z-file containing GWAS Z-score summary statistics and an LD-file describing the pairwise correlation between pairs of SNPs in the input VCFs. The LD-file can be generated using PLINK [@plink, @purcell2007], and should include dbSNP rsIDs in the first column and their scores in the second column. The LD-file should be a square matrix representing the pairwise comparison of the SNPs in the same order they appear in the Z-file. There are three output files: (1) causal SNPs in the GWAS that provided the Z-scores in the Z-file, (2) the causal posterior probability for each SNP in GWAS, and (3) the colocalization posterior probability (CLPP) for each SNP.

CAVIAR is used to leverage Bayesian statistical methods to predict the SNPs in LD that are likely causal. Indeed, neighboring SNPs have inflated statistical associations to complex traits due to linkage disequilibrium. As a result, it is difficult to discern the true causal SNPs contributing to the trait (in this case, cancer). We can also use eCAVIAR to integrate the expressed variant data with tissue specific eQTL data to predict the tissue specific effect of the SNPs [@hormozdiari2014]. For obtaining expression quantitative trait loci (eQTLs) data, the GTEx (genotype-tissue expression) dataset needs to have sufficient data across the tissues of interest.

Identification of disease-correlated variants

The FUMA web server [@fuma] was used to identify potential lead SNPs by controlling for LD structure [@watanabe2017]. We can overlap these SNPs with the fine-mapped SNPs outputted by CAVIAR to obtain a set of high confidence causal SNPs. Linkage disequilibrium (LD) score regression [@bulik2015] was conducted by using PLINK to identify to what extent these cancer-associated SNPs are enriched in promoter and regulatory regions of the genome specific to particular cell types and tissues. This can uncover which cell types can be potentially targeted for therapies.

Clinical Reporting

In this workflow, users are expected to input a fastq file and either patient or research subject phenotype information into the pipelines above. Annotated vcfs and expression matrices from the pipelines above then feed into the visualization suite below.

Variant prioritization and Annotation Input data | fastq | phenotypes

Figure 1. SnpReportR Pipeline. (0) The pipeline begins with .fastq file inputs. (1) SnpReportR accepts an annotated .vcf file as an input in R. (2) SnpReportR mines data on genes of interest from HumanMine. (3) Annotated data from input .vcf files and data scraped from HumanMine are used to generate detailed clinical reports.

Visualization in an Interactive Report

The report includes searchable and sortable tables to help visualize the data in tabular format and allows the user to quickly find genes of interest using the Javascript backed datatable DT v0.17 R package. The query results from InterMineR include expression values in transcripts per million (TPM) for genes identified with SNVs/indels by the CTAT Mutations pipeline. The expression patterns are displayed as barplots of the expression scores and are made interactive with the use of plotly R v4.9.3. Expression of these genes in normal tissues can help elucidate whether the expressed variant may be targetable, for example, as it may contain neoantigens. Next, the type of the SNVs (coding, intronic, missense, ect) in the genes is summarized in a donut plot to show the relative frequency of each mutation type identified in the patient sample using ggplot2 v3.3.3. Then the position of the SNV/indel is plotted with Gviz v1.34.1 and the bioconductor transcript database (TxDB) object to show the location of the variant within the transcript isoforms and provides an ideogram to define the chromosome location and karyotype band. The genes with variants identified are then examined for expression using the RNAseq normalized counts data using boxplots with ggplotly. Differentially expressed genes are reported in tables to determine if any SNVs are highly expressed compared to a control group and visualized interactively with ggplotly as a volcano plot.

Results

Pipeline Analysis

The RNAseq variant calling and annotation pipeline creates VCF and HTML formatted outputs that describe the impact and clinical relevance of each expressed variant. These annotations are used to prioritize, filter, and analyze factors of individual variants including their presence within global populations, relevance to cancer, and impact on RNA-editing sites.

Differential expression testing produced a list of significant differentially expressed genes with associated log2 fold change values and p values. These results were used to give an example of the format and types of information that would be expected from differential expression testing so that it can be easily integrated into clinical reports.

The pipeline leads to HTML interactive graphs giving the ancestry and relatedness between the samples. The output will also include a list of the variants likely associated with the disease.

While able to be executed as a standalone program, DSVifier can build upon the outputs of the CTAT Mutations pipeline to generate data-rich annotated .vcf files, and it is our intention for these programs to be used together, as parts of a complete analysis pipeline. After using each of the aforementioned programs, the resulting annotated .vcf files then serve as inputs for the final program in the analysis pipeline, SnpReportR, an R package to generate detailed RNA-seq analysis reports from our analysis pipeline for both clinicians and researchers.

Annotation

SnpReportR utilises multiple databases and links variants to genes and annotates gene impact, allele frequency, and the overlap with clinically-relevant SNVs. All data, including are available for download in a tab-delimited file. Each variant has been extensively annotated and aggregated in a customizable table using OMIM/OMIA Variant, dbSNP - Variant Allele origin and allele frequency known or predicted RNA-editing site, Repeat family from UCSC Genome Browser Repeatmasker Annotations Homopolymer adjacency Entropy around the variant Splice adjacency FATHMM pathogenicity prediction COSM.

Report generation

Finally, a user-friendly customisable report is generated. SnpReportR provides a detailed HTML output that describes variants within an inputted VCF file, shown in Figure 1. SnpReportR creates two separate reports, the first is aimed at patients and non-specialist clinicians, and the second report provides more in-depth information for genetic researchers. The HTML report is generated using a VCF file for the CTAT Mutations pipeline and includes annotations on the genes identified to be most likely to be disease causing. That is, variant detection from RNAseq provides numerous snvs and small indels, but most are not associated with pathogenic potential. The VCF filtering completed during the report generation identifies only those genes with pathogenic potential and with moderate to high impact on the genes’ protein product (eg. frameshift or early termination). The most likely pathogenic candidates are then further annotated for the information included in Table 3.

In the gene level summary of the most pathogenic variants identified, each column in the dynamic table can be sorted and searched dynamically, and all data used by the app is available for download in tab-delimited files. By default, allele frequency is reported based on dbVar and gnomAD genomes and exomes. SnpReportR utilises multiple databases and links variants to genes and annotates gene impact, allele frequency, and the overlap with clinically-relevant variants. In addition, the report includes extensive variant annotation from OpenCRAVAT including known and predicted RNA-editing sites, repeat family from UCSC Genome Browser, homopolymer adjacency, entropy around the variant, splice adjacency, FATHMM pathogenicity prediction, and COSMIC annotation.

Table 3: Description of annotations provided by snpReportR output.

Annotation Description of annotation
gene-level summary dynamic table with the annotated variants that impact the gene
Gene variant resources dynamic table with the annotated variants that impact the gene
Gene variant visualizations Plots of the genomic location of each variant and the frequency variant types, including missense, synonymous, and non-coding regions

1- Patients, non-specialist clinicians

The report was designed to use a patient friendly language. The R package provides opportunities to customize the header and include a user’s institution or logo.

Figure 2. SNPReportR interactive report, SNVs. SnpReportR generates detailed, accessible genetic results reports. The report includes information on SNV pathogenicity, impact, and enrichment in genes.

Figure 3. SnpReportR interactive report, tissue expression. SnpReportR reports contain detailed information on tissue expression of genes of interest. The figure is supported by plotly R package, which allows a user to hover over the bar plots and receive additional information about the tissue expression.

2- Genetic researchers

Figure 4. SnpReportR interactive report, expressed variants. (A) SnpReportR reports include the location of SNVs in the gene body data for expressed variants, as well as location of the SNV in the gene transcripts and chromosome using Gviz. (B) Expression of the gene variants is displayed as a box plot between two conditions in the DE analysis. The DE gene lists are in the second tab “DE genes by condition” and can be searched, filtered, and sorted.

Figure 5. SnpReportR interactive report, publications. SnpReportR reports include recent publications in which genes of interest are mentioned.

Discussion

SnpReportR allows researchers to perform integrated analysis of NGS data by identifying significant correlations in the genome that may be genetically-important informers of disease or pharmacological effects [@fernandez2019]. Specifically, SnpReportR can be used to uncover information regarding the relevance of one or more variants, whether it be in the context of population or disease [@pedersen2020], and yield statistical summary of the variants likely associated with a disease [@hormozdiari2014]. Importantly, SnpReportR outputs two reports that will help bridge the gap in expertise among the various health care professionals: a comprehensive genetic report aimed for genetic researchers, as well as a report designed for non-specialist clinicians and patients. A first priority for the next iteration of this project is an API where researchers can search for expressed variants across patients.

Software availability

  1. Expressed Variant Impact - Source code available at https://github.com/collaborativebioinformatics/expressed-variant-impact
  2. SnpReportR - Source code available at https://github.com/collaborativebioinformatics/expressed-variant-reporting
  3. DSVifier - Source code available at https://github.com/collaborativebioinformatics/DSVifier
  4. Differential Expression and Variant Association - Source code available at https://github.com/collaborativebioinformatics/Differential_Expression_and_Variant_Association

Competing interests

Ben Busby is a full time employee of DNAnexus.

Grant information

Ahmad Al Khleifat is funded by The Motor Neurone Disease Association and NIHR Maudsley Biomedical Research Centre.

Alan M. Cleary and Sam Hokin are funded by USDA-ARS Cooperative Agreement #5030-21000-069-02-S.

Jenny Leopoldina Smith is funded by the Fred Hutchinson Cancer Research Center, Seattle, WA.

The work of Adelaide Rhodes was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.

Brandon Michael Blobner is supported by NIH grant 5T32DK063922-18.

Acknowledgements

We would like to thank the OpenCravatGroup at Johns Hopkins University, Carnegie Mellon University Libraries, and DNAnexus Inc. for helping with organizing and hosting this event. DNAnexus Inc. also sponsored cloud computing resources. Patrick Cambell and Hannah Gunderman of CMU Libraries provided manuscript preparation support.

References