Skip to content

Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls

License

Notifications You must be signed in to change notification settings

jingquanlim/quandico

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

quandico

Abstract

The tool quandico applies statistical methods to detect copy number variations (CNV) in a sample by comparing the read counts from next generation sequencing (NGS) performed after PCR-enrichment of regions of interest, typically a set of genes with known or expected relevance for the sample, e.g. genes that play a role in cancer. Counts from a normal control (ideally matched normal from the same individual, e.g. healthy tissue) are required.

A detailed description can be found in the publication of this tool: Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls: BMC Bioinformatics 2015, 16:17.

Implementation

The main script is written in R and there are two helper scripts and one main driver script written in Perl.

Workflow

  1. PCR-enrichment of target regions, sequencing and read-mapping (SAM/BAM file)
  2. Extract counts for every PCR amplicon (qgetcounts)
  3. Cluster read-counts into regions (qcluster)
  4. Compare counts per cluster and perform CNV calling (quandico)

Steps 2, 3 and 4 can be performed with one single command line call of the Perl driver script quandico. Step 4 alone can be performed inside R.

Requirements

Running quandico requires R with some commonly available packages from CRAN. Please visit The R-project homepage for advice how to install R on your system. Dependencies for quandico should be installed automatically when installing the package.

Accessory applications such as a command-line driver script (quandico), the script to extract counts of mapped reads (qgetcounts) and the region clustering script (qcluster) require Perl. Perl is already installed on most Linux systems, please check Perl.org for details. If you are using Windows we recommend Strawberry Perl. All dependencies should automatically be installed from CPAN.

Installation

Fast

Download the install.sh shell script for Unix/Linux. Check the configuration in the first section of that script, adjust parameters and then start it in a clean directory, e.g.:

$ mkdir quandico
$ cd quandico
$ wget https://github.com/reineckef/quandico/raw/master/install.sh
$ sh install.sh

This is not well tested and may fail on various systems for various reasons.

Detailed

Install these R packages, either by using R> install.packages('name') inside an R session or by running the install command from a command line: $ R CMD INSTALL 'name':

  • MASS
  • ggplot2
  • grid
  • gridExtra
  • naturalsort
  • scales

The string n.m will be used for the version number (major.minor) of quandico. This was 1.14 by the time of writing.

  • Download the latest release and install the packaged R code of quandico (file quandico_n.m.tar.gz):

$ R CMD INSTALL quandico_n.m.tar.gz

optional but recommended

  • Download the latest release and install the Perl module QUANDICO (file QUANDICO-vn.m.tar.gz) using make on Linux, and dmake on Windows:
$ tar xvfz QUANDICO-vn.m.tar.gz
$ cd QUANDICO-vn.m
$ perl Makefile.PL
$ [d]make
$ [d]make test
$ [d]make install # run this as sudo/root for system install

Alternatively, you can use cpanm from App::cpanminus to install to module directly from the downloaded archive. We recommend to use --verbose mode:

$ cpanm --verbose QUANDICO-vn.m.tar.gz # run this as sudo/root for system install

To start from mapped reads in SAM/BAM format, the Perl script qgetcounts will call samtools (version 1.1 or later) to extract the counts. Therefore, samtools needs to be installed, please visit Samtools for advice.

Example Data

Example data files to test the tools are available on Google drive. To start from mapped reads (in SAM or BAM files, please note that this step additionally requires samtools version >= 1.1).

Please note, these files can also be downloaded using wget (sometimes these links do not work, for unclear reason):

$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITQjBUZUFVcy1BNFk -O CNA902Y.bed
$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITMzNyakhveTh3UVE -O M62_NA13019.bam
$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITSnU1TlVRSjRXRHM -O M62_NA12878.bam

To skip count extraction and start with clustering, the extracted counts of these samples are also available:

Please note, these files can also be downloaded using wget (sometimes these links do not work, for unclear reason):

$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITUDZ0aXFBd2pDR0k -O M62_NA13019.tsv
$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITWU9xTndtZE5iOEE -O M62_NA12878.tsv

A file with gene names and coordinates is required if clusters should be named using gene names. For human assemblies GRCh37 (hg19) and GRCh38 (hg38), these file can be downloaded from UCSC:

Clustered count files ready for processing with the final step are these:

Please note, these files can also be downloaded using wget (sometimes these links do not work, for unclear reason):

$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITYTduNDY4azJNZXM -O M62_NA13019.clustered
$ wget --no-check-certificate https://googledrive.com/host/0BzLnl09R3GITWm1FS0duczVlejQ -O M62_NA12878.clustered

Running

You can run all steps separately:

# extract counts
$ qgetcounts -i M62_NA13019.bam -a CNA902Y.bed > M62_NA13019.tsv
$ qgetcounts -i M62_NA12878.bam -a CNA902Y.bed > M62_NA12878.tsv

# cluster the counts
$ qcluster -i M62_NA13019.tsv [--names refGene.txt] > M62_NA13019.clustered
$ qcluster -i M62_NA12878.tsv [--names refGene.txt] > M62_NA12878.clustered

# call copy numbers
$ quandico --no-cluster \
   -s data=M62_NA13019.clustered \   # file with clustered counts
   -r data=M62_NA12878.clustered \   # file with clustered counts
   -s x=2 -s y=0 -r x=2 -r y=0   \   # sample (-s) and reference (-r) are female
   [--cp names=refGene.txt]          # optional for naming of clusters

Alternatively, all this can be done using one single command:

$ quandico -s map=M62_NA13019.bam -s x=2 -s y=0 \ # sample
           -r map=M62_NA12878.bam -r x=2 -r y=0 \ # reference
           -a CNA902Y.bed                       \ # amplicons
           -d results -b 13019_vs_12878         \ # output location and name
           [--cp names=refGene.txt]               # optional cluster names

Example Output

In the folder 'ExampleOutput' the result files obtained by running the above mentioned steps with sample NA13019 and reference NA12878 from sequencing run M62 are presented. These are:

An image from one page (number 19) from the detailed PDF shows a loss (top) and a gain (bottom) event in that sample on chromsome X:

Example page

Help

The Perl command-line tools present a help and usage information when run without arguments (or by explicitly using -h or --help).

The R function provides help (after installation) by doing:

R> library(quandico)
R> ?quandico

Issues

Please report bugs to the author Frank Reinecke or file them on quandico's home on GitHub.

About

Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 56.0%
  • Perl 37.5%
  • Shell 6.5%