Skip to content
Brian Haas edited this page Sep 26, 2024 · 23 revisions

CTAT - Virus Integration Finder (ctat-VIF)

VirusIntegrationFinder (ctat-VIF) is a module of the Trinity Cancer Transcriptome Analysis Toolkit and used to identify virus insertion sites within the human genome. VIF leverages the STAR aligner to identify chimeric NGS read alignments involving the human genome and a database of viral sequences.

ctat-VIF operates to:

  • capture evidence of virus-matching reads (consistent with an infection)
  • identify and quantify evidence for viral insertion sites in the human genome
  • provide interactive visualizations for evidence of virus-mapped reads and virus insertion sites

Obtaining and Installing VirusIntegrationFinder (ctat-VIF)

See VIF Installation Instructions wiki page for details.

Running ctat-VIF

ctat-VIF requires as input Illumina NGS reads, which can be DNA-seq or RNA-seq, and a CTAT genome lib supplemented with viral genomes.

The simplest execution would be:

ctat-vif --left_fq reads_1.fastq \
         --right_fq reads_2.fastq \
         --sample_id my_sample_name \
         -O output_directory \
         --genome_lib_dir /path/to/ctat_genome_lib_dir 

run 'ctat-vif --help' for full usage info.

STAR alignment step involved here generally requires around 45 GB RAM.

ctat-VIF Outputs

ctat-VIF includes tab-delimited summaries of counts of reads corresponding to the virus insertion sites, counts of reads mapping to the different viral genome targets, and various plots and interactive genome views for interrogating these data. Outputs are detailed below, all based on the included sample data and involve human papillomavirus (HPV).

Virus insertion site candidates and read counts

Output file 'vif.insertion_site_candidates.tsv' contains the following insertion site candidates in tab-delimited format and described below:

chrA   coordA     orientA  chrB   coordB     orientB  prelim.primary_brkpt_type  prelim.total  split  span  total
chr11  117344940  +        HPV16  1215       +        Split                      117           33     79    112
chr18  73320805   +        HPV16  7409       -        Split                      168           31     72    103
chr6   64935774   +        HPV16  4034       -        Split                      177           30     70    100
HPV16  4029       +        chr3   937922     +        Split                      1             0      0     0
HPV16  1215       -        chr11  108577428  +        Split                      1             0      0     0
chr2   32018732   +        HPV16  1215       +        Split                      1             0      0     0
chr1   218562612  +        HPV16  1215       +        Split                      1             0      0     0

The coordinates and orientation of the virus fragment are indicated. The 'prelim' columns represent preliminary counts of reads (see algorithmic details below), and the final tally of evidence reads are in the final columns.

The split reads represent those individual reads that align partly to the human genome and partly to the viral genome. Spanning reads are paired-ends where one read maps entirely to the human genome and the other entirely to the viral genome and bridge the insertion site.

The total counts are plotted as 'vif.insertion_site_candidates.genome_plot.png' and as shown below:

genome_wide_abundance.png

Of course, if there are few reads supporting an insertion site, the evidence is very weak and it could be just noise. High read counts are more likely to correspond to valid insertion sites - but examine the reads in the interactive genomics viewer to ensure (shown below).

Virus Insertion Viewer

An interactive igv-reports html file 'vif.html' is included for navigating the read alignment evidence supporting each insertion site. Simply open the html file in your web browser and begin navigating.

At the top of the page you'll have the data table above. Just click on an entry of interest and it will provide the corresponding region in the web-based IGV panel like so:

VIF_insertion_example

The region of the human chromosome at the putative site of virus insertion is shown along with the reads aligning as split or spanning reads and providing evidence.

beware of low complexity alignments that could be red-herrings... Such regions will be flagged in a future software release.

Virus Infection Evidence Viewer

Even if there isn't strong evidence for virus insertions, we can still detect reads aligning to the target virus sequences and provide interactive visualizations.

File 'vif.virus_read_counts_summary.tsv' contains a tab-delimited summary of read counts corresponding to each viral genome:

virus  seqlen  mapped  chim_reads
HPV16  7906    4546    466

The 'vif.virus.igvjs.html' provides an interactive igv-reports view, tabulating the counts of reads per virus above, and yielding the IGV view for the entry you select, such as below:

virus_genome_view

Note, the IGV html is limited to showing about 100x coverage of reads across the genome. This avoids overloading the web browser with an otherwise massive html file. To examine higher coverage and differences in coverage across the genome that exceed this cap, please load the viral-aligned bam file into desktop IGV and interrogate further.

Questions, comments, etc?

Contact us via our google group: https://groups.google.com/forum/#!forum/trinity_ctat_users

Funding

Trinity CTAT is funded by the National Cancer Institute Informatics Technology for Cancer Research

Our efforts related to building a Trinity Cancer Transcriptome Analysis Toolkit are described in this Youtube screencast:

Trinity CTAT on Youtube

Acknowledgements

VIF was developed in collaboration with Cristina Montagna, Jack Lenz, and Anne Van Arsdale at Albert Einstein College of Medicine, along with Brian Haas and Joshua Gould at the Broad Institute, and Alexander Dobin at Cold Spring Harbor Laboratory.