Copy number variation is an important and abundant source of variation in the human genome, which has been associated with a number of diseases, especially cancer. Massively parallel next-generation sequencing allows copy number profiling with fine resolution. Such efforts, however, have met with mixed successes, with setbacks arising partly from the lack of reliable analytical methods to meet the diverse and unique challenges arising from the myriad experimental designs and study goals in genetic studies. In cancer genomics, detection of somatic copy number changes and profiling of allele-specific copy number (ASCN) are complicated by experimental biases and artifacts as well as normal cell contamination and cancer subclone admixture. Furthermore, careful statistical modeling is warranted to reconstruct tumor phylogeny by both somatic ASCN changes and single nucleotide variants. Here we describe a flexible computational pipeline, MARATHON (copy nuMber vARiAtion and Tumor pHylOgeNy), which integrates multiple related statistical software for copy number profiling and downstream analyses in disease genetic studies.
Urrutia E, Chen H, Zhou Z, Zhang NR, Jiang Y. Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics, 34 (12), 2126-2128, 2018. (link)
If you have any questions or problems when using MARATHON, you can: (i) open a new issue here; (ii) post in our Google user group https://groups.google.com/d/forum/marathon_genomics or email us at [email protected]; (iii) email the maintainers of the corresponding packages -- the contact information is shown under Developers & Maintainers. The first two contact options are preferred and we will try our best to reply as soon as possible.
A docker image is available here. This image is an Rstudio GUI built on rocker/tidyverse with MARATHON as well as all of its dependent packages and datasets pre-installed. Note that this can take a while to download the human reference genome as well as the toy sequencing dataset. Instructions for using Docker can be found here.
docker pull lzeppelini/marathon
Install all packages in the latest version of R.
install.packages(c("falcon", "falconx", "devtools"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("WES.1KG.WUGSC", "GenomeInfoDbData", "GenomeInfoDb", "VariantAnnotation"))
devtools::install_github(c("yuchaojiang/CODEX/package", "yuchaojiang/CODEX2/package", "yuchaojiang/Canopy/package", "zhouzilu/iCNV", "yuchaojiang/MARATHON/package"))
The possible analysis scenarios are listed in Table 1. Figure 1 gives an outline for the relationship between the software: CODEX and CODEX2 perform read depth normalization for total copy number profiling; read depth normalized by CODEX/CODEX2 is received by iCNV, which combines it with allele-specific read counts and microarray data to detect CNVs; FALCON and FALCON-X perform ASCN analysis; and Canopy receives input from FALCON/FALCON-X to perform tumor phylogeny reconstruction.
Figure 1. A flowchart outlining the procedures for profiling CNV, ASCN, and reconstructing tumor phylogeny. CNVs with common and rare population frequencies can be profiled by CODEX and CODEX2, with and without negative control samples. iCNV integrates sequencing and microarray data for CNV detection. ASCNs can be profiled by FALCON and FALCON-X using allelic read counts at germline heterozygous loci. Canopy infers tumor phylogeny using somatic SNVs and ASCNs.
Table 1. Analysis scenarios and pipeline design. The last column shows the sequence of software that should be used for each analysis scenario. * By “normal” we mean samples that are not derived from tumor tissue, which are not expected to carry chromosome-level copy number changes.
R notebook with step-by-step demonstration and rich display is available here. Corresponding Rmd script is available here.
Please cite MARATHON as well as all the dependent packages that you use.
-
MARATHON: Urrutia et al. 2018 Bioinformatics
Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny (GitHub) -
CODEX: Jiang et al. 2015 Nucleic Acids Research
A Normalization and Copy Number Variation Detection Method for Whole Exome Sequencing (Bioconductor, GitHub) -
CODEX2: Jiang et al. 2018 Genome Biology
Full-spectrum copy number variation detection by high-throughput DNA sequencing (GitHub) -
iCNV: Zhou et al. 2017 Bioinformatics
Integrated copy number variation detection toolset (GitHub) -
FALCON: Chen et al. 2015 Nucleic Acids Research
Finding Allele-Specific Copy Number in Next-Generation Sequencing Data (CRAN) -
FALCON-X: Chen et al. 2017 Annals of Applied Statistics
Finding Allele-Specific Copy Number in Whole-Exome Sequencing Data (CRAN) -
Canopy: Jiang et al. 2016 PNAS
Accessing Intra-Tumor Heterogeneity and Tracking Longitudinal and Spatial Clonal Evolutionary History by Next-Generation Sequencing (CRAN, GitHub)
-
Gene Urrutia (gene dot urrutia at gmail dot com)
Innovation, Hill-Rom Corp. -
Yuchao Jiang (yuchaoj at email dot unc dot edu)
Department of Biostatistics & Department of Genetics, UNC-Chapel Hill -
Hao Chen (hxchen at ucdavis dot edu)
Department of Statistics, UC Davis -
Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn -
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn