title

title_short

Abstract

With the increasing availability of next-generation sequencing (NGS), patients and non-specialist health care professionals are obtaining their genomic information without sufficient bioinformatics skills to analyze and interpret the data. In January 2021, four teams of scientists, clinicians, and developers from around the world worked collaboratively in a virtual hackathon to create a framework for the automated analysis and interpretation of RNA sequencing data in the clinic. Here, we present SnpReportR: A Tool for Clinical Reporting of RNAseq Expression and Variants aimed for use by clinicians and others without in-depth knowledge of genetics.

Introduction

In clinical practice and biomedical research, next-generation sequencing (NGS) and subsequent identification of genomic variants including single nucleotide variations or polymorphisms, small insertions or deletions, and structural variants is an established method used to investigate the genetic causes and associations of disease. While whole genome and whole exome sequencing are relatively expensive, RNA-sequencing (RNA-seq) is a highly cost-effective and versatile method that assays both gene sequence and expression, thus yielding both genetic and functional signatures in a single technique [@hong2020, @horak2016]. In cancer diagnostics, specifically, RNA-Seq provides a means of detecting nucleotide-level resolution of mutations within transcribed regions of oncogenes and tumor suppressor genes for expressed loci.

While RNA-seq is a powerful tool for characterizing tumors, identifying and interpreting the variants that are relevant to the diagnosis or understanding of disease within a patient remains challenging. The biopsy tissue sampled for the analysis can be composed of a mix of tumor and normal cells that leads to averaged gene expression data from a heterogeneous mixture of cells [@yoshihara2013; @fan2020]. Another practical challenge for using gene expression data is the need for sophisticated tools that may require significant bioinformatics expertise and high-performance computing to carry out the data processing and analysis [@fernald2011]. For those outside the immediate field of genetics such as researchers, hospital staff, the interpretation of findings is particularly challenging. Therefore, it is important to have tools that take into account these gaps in specialized bioinformatics knowledge.

We, therefore, developed SnpReportR, a clinical genetic reporting framework for the automated analysis and interpretation of RNA sequencing data. The software is designed for use by clinicians and other users without an in-depth background in genetics.

Methods

In a coordinated codeathon, teams worked collaboratively with the same dataset to provide a framework for clinical reporting of RNA-Seq research. The gene expression datasets used as test data in this study were obtained from publicly available databases from The DataBase of Kashiwa Encyclopedia of Researchers of multi-Omics data (DBKERO)[https://kero.hgc.jp/]. The raw cell files were reanalysed using HISAT2 and featureCounts to obtain summarised expression values for transcripts. Differential expression was performed by using edgeR. CAVIAR was then used to distinguish causal variants from other signals in the data, such as population stratification.The vcf files produced from the CTAT Mutations pipeline v2.5 are analyzed using vcfR v1.12.0 and the bioconductor package Variant Annotation v1.36.0. Genomic visualizations are completed with Gviz v1.34.1 and the snpReportR tool provides a helper function to create genomic references for use with Gviz . Finally, the tool generates two reports, one is aimed for clinical use and the second aimed for researchers, informing the interpretation of genetic variants pertaining to the gene provided by the user. Both reports provide RNA-Seq summary analysis which includes diseases association, disease gene identification, gene prioritization, tissue expression information, supported with data visualisation and 5 recent publications related to the identified gene. Tissue expression, disease association, publication, and snp data for differentially expressed genes is extracted from HumanMine [@smith2012] using the InterMineR v1.1 R package. These multifaceted and integrated results were used to construct an interactive clinical report, which is sent to the user via gmailr [@hester2019] and the HTML email produced using the Blastula v.0.3.2. The report is generated using a customized Rmarkdown template with parameterized inputs, which allows for easy customization.

Test Data

For demonstration purposes, we identified a dataset that was agreeable for all teams’ efforts and was used solely as an example for showcasing the workflow. This does not represent the kind of patient-level data that would be best suited to this analysis pipeline. The resulting dataset is a lung cancer dataset [@suzuki2019]; ENA accession: PRJDB6952 corresponding to 23 lung cancer cell lines treated with 95 compounds including approved receptor tyrosine kinase inhibitors and epigenetic targeting drugs. High-throughput RNA-seq was conducted with four different drug concentrations at three time points (24,48,72h).

Table 1. Three cell lines and four treatments were included for testing (124 total samples). [@suzuki2019]; ENA accession: PRJDB6952

Cell Line	Drug	Number of Samples
A549	(+)-JQ1 (Inhibitor_BET (BRD4))	12
	DMSO (Control)	9
	Etoposide (Inhibitor_Topo II)	11
	Temsirolimus (Inhibitor_mTOR)	11
H1299	(+)-JQ1 (Inhibitor_BET (BRD4))	11
	DMSO (Control)	9
	Etoposide (Inhibitor_Topo II)	12
	Temsirolimus (Inhibitor_mTOR)	11
II-18	(+)-JQ1 (Inhibitor_BET (BRD4))	8
	DMSO (Control)	8
	Etoposide (Inhibitor_Topo II)	11
	Temsirolimus (Inhibitor_mTOR)	11

Variant Calling and Annotation

To identify variants, the CTAT Mutations pipeline was used with the FASTQ files mentioned above with the GRCh38 human reference. This pipeline integrates ‘GATK Best Practices’ with additional operations to annotate and filter variants and to prioritize variants that may be relevant to cancer biology. The CTAT pipeline then annotates variants with RADAR [@zhang2020] and RediPortal [@picardi2017] databases for identifying likely RNA-editing events, COSMIC [@tate2019] to highlight known cancer mutations, and dbSNP [@sherry2001] and gnomAD [@karczewski2020] to annotate common variants. OpenCRAVAT [@pagel2020] is subsequently used to annotate and prioritize variants according to likely biological impact and relevance to cancer. The cancer-related annotations from OpenCRAVAT included ClinVar [@shihab2013], CHASMplus [@tokheim2019], MuPIT [@niknafs2013], VEST [@carter2013] and FATHMM [@shihab2013]. We encoded the pipeline using Workflow Description Language (WDL) and deployed the workflow on the DNANexus cloud computing platform as an app. To run the pipeline on DNANexus, create a new workflow and then select the “Trinity CTAT” app from the Tool Library. The tool takes three inputs: left FASTQ, right FASTQ, and the CTAT Genome Library, which can be obtained from the STAR-Fusion [@haas2019] GitHub page.

Analysis of gene expression

To incorporate differential expression results in the clinical report, a subset of the Suzuki et al. (2019, ENA accession: PRJDB695) test data (Table 2) were used. The differential expression testing was performed to obtain results and formatting information only and was not evaluated for biological impact. All analyses were performed in Galaxy for ease of use. Raw RNA-Seq reads were aligned to GRCh38 using HISAT2 (v.2.1.0) [@kim2019], normalized counts were estimated using featureCounts (v.1.6.4) [@liao2014], and differential expression testing was performed using edgeR (v.3.24.1) [@robinson2010].

Table 2. Reduced dataset for differential expression testing.

Run	BioProject	Cell Line	Sample Name	Drug	Concentration ($\mu$M)
DRR131576	DRR131576	A549	RNA-seq_A549_24h_B07_Etoposide (Inhibitor_Topo II)_1	Etoposide (Inhibitor_Topo II)	1.0
DRR131588	PRJDB6952	A549	RNA-seq_A549_24h_C07_Etoposide (Inhibitor_Topo II)_0.1	Etoposide (Inhibitor_Topo II)	0.1
DRR131599	PRJDB6952	A549	RNA-seq_A549_24h_D07_Etoposide (Inhibitor_Topo II)_0.01	Etoposide (Inhibitor_Topo II)	0.01
DRR132310	PRJDB6952	H1299	RNA-seq_H1299_24h_B07_Etoposide (Inhibitor_Topo II)_1	Etoposide (Inhibitor_Topo II)	1.0
DRR132321	PRJDB6952	H1299	RNA-seq_H1299_24h_C07_Etoposide (Inhibitor_Topo II)_0.1	Etoposide (Inhibitor_Topo II)	0.1
DRR132333	PRJDB6952	H1299	RNA-seq_H1299_24h_D07_Etoposide (Inhibitor_Topo II)_0.01	Etoposide (Inhibitor_Topo II)	0.01

Gene and loci identification

The DSVifier pipeline is comprised of two tracks that are processed in parallel with the existing bioinformatics tools Somalier [@pedersen2020] and CAVIAR [@hormozdiari2014] and is designed to identify the variants correlating with a particular disease. Somalier requires a single merged VCF input file, and produces TSV and HTML file outputsthat describe aspects of the input variants, such as their relatedness and ancestry. Somalier extracts informative sites, evaluates relatedness, and performs quality-control. CAVIAR requires two additional input tab-delimited input files; specifically, Z-file containing GWAS Z-score summary statistics and an LD-file describing the pairwise correlation between pairs of SNPs in the input VCFs. The LD-file can be generated using PLINK [@plink, @purcell2007], and should include dbSNP rsIDs in the first column and their scores in the second column. The LD-file should be a square matrix representing the pairwise comparison of the SNPs in the same order they appear in the Z-file. There are three output files: (1) causal SNPs in the GWAS that provided the Z-scores in the Z-file, (2) the causal posterior probability for each SNP in GWAS, and (3) the colocalization posterior probability (CLPP) for each SNP.

CAVIAR is used to leverage Bayesian statistical methods to predict the SNPs in LD that are likely causal. Indeed, neighboring SNPs have inflated statistical associations to complex traits due to linkage disequilibrium. As a result, it is difficult to discern the true causal SNPs contributing to the trait (in this case, cancer). We can also use eCAVIAR to integrate the expressed variant data with tissue specific eQTL data to predict the tissue specific effect of the SNPs [@hormozdiari2014]. For obtaining expression quantitative trait loci (eQTLs) data, the GTEx (genotype-tissue expression) dataset needs to have sufficient data across the tissues of interest.

Identification of disease-correlated variants

The FUMA web server [@fuma] was used to identify potential lead SNPs by controlling for LD structure [@watanabe2017]. We can overlap these SNPs with the fine-mapped SNPs outputted by CAVIAR to obtain a set of high confidence causal SNPs. Linkage disequilibrium (LD) score regression [@bulik2015] was conducted by using PLINK to identify to what extent these cancer-associated SNPs are enriched in promoter and regulatory regions of the genome specific to particular cell types and tissues. This can uncover which cell types can be potentially targeted for therapies.

Clinical Reporting

In this workflow, users are expected to input a fastq file and either patient or research subject phenotype information into the pipelines above. Annotated vcfs and expression matrices from the pipelines above then feed into the visualization suite below.

Variant prioritization and Annotation Input data | fastq | phenotypes

Figure 1. SnpReportR Pipeline. (0) The pipeline begins with .fastq file inputs. (1) SnpReportR accepts an annotated .vcf file as an input in R. (2) SnpReportR mines data on genes of interest from HumanMine. (3) Annotated data from input .vcf files and data scraped from HumanMine are used to generate detailed clinical reports.

Visualization in an Interactive Report

The report includes searchable and sortable tables to help visualize the data in tabular format and allows the user to quickly find genes of interest using the Javascript backed datatable DT v0.17 R package. The query results from InterMineR include expression values in transcripts per million (TPM) for genes identified with SNVs/indels by the CTAT Mutations pipeline. The expression patterns are displayed as barplots of the expression scores and are made interactive with the use of plotly R v4.9.3. Expression of these genes in normal tissues can help elucidate whether the expressed variant may be targetable, for example, as it may contain neoantigens. Next, the type of the SNVs (coding, intronic, missense, ect) in the genes is summarized in a donut plot to show the relative frequency of each mutation type identified in the patient sample using ggplot2 v3.3.3. Then the position of the SNV/indel is plotted with Gviz v1.34.1 and the bioconductor transcript database (TxDB) object to show the location of the variant within the transcript isoforms and provides an ideogram to define the chromosome location and karyotype band. The genes with variants identified are then examined for expression using the RNAseq normalized counts data using boxplots with ggplotly. Differentially expressed genes are reported in tables to determine if any SNVs are highly expressed compared to a control group and visualized interactively with ggplotly as a volcano plot.

Results

Pipeline Analysis

The RNAseq variant calling and annotation pipeline creates VCF and HTML formatted outputs that describe the impact and clinical relevance of each expressed variant. These annotations are used to prioritize, filter, and analyze factors of individual variants including their presence within global populations, relevance to cancer, and impact on RNA-editing sites.

Differential expression testing produced a list of significant differentially expressed genes with associated log2 fold change values and p values. These results were used to give an example of the format and types of information that would be expected from differential expression testing so that it can be easily integrated into clinical reports.

The pipeline leads to HTML interactive graphs giving the ancestry and relatedness between the samples. The output will also include a list of the variants likely associated with the disease.

While able to be executed as a standalone program, DSVifier can build upon the outputs of the CTAT Mutations pipeline to generate data-rich annotated .vcf files, and it is our intention for these programs to be used together, as parts of a complete analysis pipeline. After using each of the aforementioned programs, the resulting annotated .vcf files then serve as inputs for the final program in the analysis pipeline, SnpReportR, an R package to generate detailed RNA-seq analysis reports from our analysis pipeline for both clinicians and researchers.

Annotation

SnpReportR utilises multiple databases and links variants to genes and annotates gene impact, allele frequency, and the overlap with clinically-relevant SNVs. All data, including are available for download in a tab-delimited file. Each variant has been extensively annotated and aggregated in a customizable table using OMIM/OMIA Variant, dbSNP - Variant Allele origin and allele frequency known or predicted RNA-editing site, Repeat family from UCSC Genome Browser Repeatmasker Annotations Homopolymer adjacency Entropy around the variant Splice adjacency FATHMM pathogenicity prediction COSM.

Report generation

Finally, a user-friendly customisable report is generated. SnpReportR provides a detailed HTML output that describes variants within an inputted VCF file, shown in Figure 1. SnpReportR creates two separate reports, the first is aimed at patients and non-specialist clinicians, and the second report provides more in-depth information for genetic researchers. The HTML report is generated using a VCF file for the CTAT Mutations pipeline and includes annotations on the genes identified to be most likely to be disease causing. That is, variant detection from RNAseq provides numerous snvs and small indels, but most are not associated with pathogenic potential. The VCF filtering completed during the report generation identifies only those genes with pathogenic potential and with moderate to high impact on the genes’ protein product (eg. frameshift or early termination). The most likely pathogenic candidates are then further annotated for the information included in Table 3.

In the gene level summary of the most pathogenic variants identified, each column in the dynamic table can be sorted and searched dynamically, and all data used by the app is available for download in tab-delimited files. By default, allele frequency is reported based on dbVar and gnomAD genomes and exomes. SnpReportR utilises multiple databases and links variants to genes and annotates gene impact, allele frequency, and the overlap with clinically-relevant variants. In addition, the report includes extensive variant annotation from OpenCRAVAT including known and predicted RNA-editing sites, repeat family from UCSC Genome Browser, homopolymer adjacency, entropy around the variant, splice adjacency, FATHMM pathogenicity prediction, and COSMIC annotation.

Table 3: Description of annotations provided by snpReportR output.

Annotation	Description of annotation
gene-level summary	dynamic table with the annotated variants that impact the gene
Gene variant resources	dynamic table with the annotated variants that impact the gene
Gene variant visualizations	Plots of the genomic location of each variant and the frequency variant types, including missense, synonymous, and non-coding regions

1- Patients, non-specialist clinicians

The report was designed to use a patient friendly language. The R package provides opportunities to customize the header and include a user’s institution or logo.

Figure 2. SNPReportR interactive report, SNVs. SnpReportR generates detailed, accessible genetic results reports. The report includes information on SNV pathogenicity, impact, and enrichment in genes.

Figure 3. SnpReportR interactive report, tissue expression. SnpReportR reports contain detailed information on tissue expression of genes of interest. The figure is supported by plotly R package, which allows a user to hover over the bar plots and receive additional information about the tissue expression.

2- Genetic researchers

Figure 4. SnpReportR interactive report, expressed variants. (A) SnpReportR reports include the location of SNVs in the gene body data for expressed variants, as well as location of the SNV in the gene transcripts and chromosome using Gviz. (B) Expression of the gene variants is displayed as a box plot between two conditions in the DE analysis. The DE gene lists are in the second tab “DE genes by condition” and can be searched, filtered, and sorted.

Figure 5. SnpReportR interactive report, publications. SnpReportR reports include recent publications in which genes of interest are mentioned.

Discussion

SnpReportR allows researchers to perform integrated analysis of NGS data by identifying significant correlations in the genome that may be genetically-important informers of disease or pharmacological effects [@fernandez2019]. Specifically, SnpReportR can be used to uncover information regarding the relevance of one or more variants, whether it be in the context of population or disease [@pedersen2020], and yield statistical summary of the variants likely associated with a disease [@hormozdiari2014]. Importantly, SnpReportR outputs two reports that will help bridge the gap in expertise among the various health care professionals: a comprehensive genetic report aimed for genetic researchers, as well as a report designed for non-specialist clinicians and patients. A first priority for the next iteration of this project is an API where researchers can search for expressed variants across patients.

Software availability

Expressed Variant Impact - Source code available at https://github.com/collaborativebioinformatics/expressed-variant-impact
SnpReportR - Source code available at https://github.com/collaborativebioinformatics/expressed-variant-reporting
DSVifier - Source code available at https://github.com/collaborativebioinformatics/DSVifier
Differential Expression and Variant Association - Source code available at https://github.com/collaborativebioinformatics/Differential_Expression_and_Variant_Association

Competing interests

Ben Busby is a full time employee of DNAnexus.

Grant information

Ahmad Al Khleifat is funded by The Motor Neurone Disease Association and NIHR Maudsley Biomedical Research Centre.

Alan M. Cleary and Sam Hokin are funded by USDA-ARS Cooperative Agreement #5030-21000-069-02-S.

Jenny Leopoldina Smith is funded by the Fred Hutchinson Cancer Research Center, Seattle, WA.

The work of Adelaide Rhodes was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.

Brandon Michael Blobner is supported by NIH grant 5T32DK063922-18.

Acknowledgements

We would like to thank the OpenCravatGroup at Johns Hopkins University, Carnegie Mellon University Libraries, and DNAnexus Inc. for helping with organizing and hosting this event. DNAnexus Inc. also sponsored cloud computing resources. Patrick Cambell and Hannah Gunderman of CMU Libraries provided manuscript preparation support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper.md

paper.md

Abstract

Introduction

Methods

Test Data

Variant Calling and Annotation

Analysis of gene expression

Gene and loci identification

Identification of disease-correlated variants

Clinical Reporting

Visualization in an Interactive Report

Results

Pipeline Analysis

Annotation

Report generation

1- Patients, non-specialist clinicians

2- Genetic researchers

Discussion

Software availability

Competing interests

Grant information

Acknowledgements

References

Files

paper.md

Latest commit

History

paper.md

File metadata and controls

Abstract

Introduction

Methods

Test Data

Variant Calling and Annotation

Analysis of gene expression

Gene and loci identification

Identification of disease-correlated variants

Clinical Reporting

Visualization in an Interactive Report

Results

Pipeline Analysis

Annotation

Report generation

1- Patients, non-specialist clinicians

2- Genetic researchers

Discussion

Software availability

Competing interests

Grant information

Acknowledgements

References