Skip to content

Benchmarking

pongorlorinc edited this page May 2, 2019 · 6 revisions

For benchmarking and validation purposes, we processed published ATAC-seq data, quantified peak coverages, and generated scaled coverage tracks. To provide comparisons, we used BEDtools for validation of performance and raw read counts, and developed coverage-track generation benchmarks with IGVTools and Deeptools (bamCoverage).

Input dataset

Data from ATAC-seq was used:

"SLFN11 Blocks Stressed Replication Forks Independently of ATR. Mol Cell, 2018"

Pubmed and GEO page.

Sample processing was done as explained here.

Performance and Validation of Peak Quantification

In the six processed ATAC-seq samples, a total of 32,819 peaks were identified (transposase-accessible chromatin regions).

Performance

Since BAMscale has multi-thread capabilities, we quantified peaks using 1, 4, and 8 threads. BEDtools' multicov is a single threaded program, and was run once. BAMscale is >2-fold quicker than BEDtools using 4 threads, with similar performance using one thread. It is important to note, that BEDtools generates raw read counts only, while BAMscale outputs three different normalized tables as well (FPKM, TPM and library size). We suggest using 4 threads, since higher thread counts have minimal impact on performance.

Validation

Correlation of raw read counts between BEDTools and BAMscale was above 0.999 in all cases.

Performance and validation of scaled coverage track generation

Next we compared performance of coverage track generation to IGVTools and Deeptools bamCoverage programs. IGVTools can only generate unscaled tracks, with no read filter functionality. Deeptools "bamCoverage" on the other hand is capable of filtering reads and output scaled coverage tracks.

Performance

BAMscale using a single thread outperforms IGVtools, and outperforms deeptools (using 4 threads) by ~7-fold using a single thread. We see some performance increase by adding additional threads to BAMscale, but >4 threads is not suggested.

Comparison

Results from the three methods output related results, with some minor differences in track height, due to differences in binning technique and scaling.

Pre-processing ATAC-seq data

1) Aligning fastq Files

Raw sequencing reads were downloaded using the fasterq-dump program sratoolkit, and aligned using the BWA mem aligner using default alignment settings (but multithreaded) to the hg19 human genome. Reads aligned with BWA were directly converted from SAM format to BAM format using samtools.

bwa mem -t 8 <hg19.idx> <PAIR1.fq> <PAIR2.fq> \
      | samtools view -Sbh \
      > out.unsorted.bam

2) Sorting Reads and Marking Duplicate Reads (alignments)

The aligned and unsorted reads were then sorted by samtools. Sorted reads were then indexed, duplicate reads were marked using picard-tools, and indexed one additional time.

samtools sort -o out.sorted.bam out.unsorted.bam

samtools index out.sorted.bam

java -Xmx8g -jar picardtools.jar MarkDuplicates \
      I=out.sorted.bam \
      O=out.sorted.dedup.bam \
      M=out.sorted.dedup.metrics

samtools out.sorted.dedup.bam

3) Peak Calling

Peaks were called using the MACS2 peak caller. Since the data was paired-end, the "BAMPE" flag was set.

macs2 callpeak -t out.sorted.dedup.bam --outdir outdir \
      -f BAMPE -g hs --cutoff-analysis -q 0.01 \

4) Generating Uniform Peak Sets

Called peaks files were merged to create one BED file for all samples (using the unix cat command). Since many peak positions are common across samples, peaks were first sorted based on their position, then merged into a unique set of peaks using bedtools.

cat *.narrowPeak \
      | bedtools sort \
      | bedtools merge \
      > merged_peaks.bed   

5) Quantifying Peaks

Peaks were quantified using the BAMscale "cov" function. The output of this step are tables with: raw read counts, FPKM, TPM and library-size scaled peak-scores. For comparison, we calculated raw read counts using the BEDTools "multicov" program.

./BAMscale cov --bed merged_peaks.bed \
      --bam SAMPLE1.sorted.dedup.bam \
      --bam SAMPLE1.sorted.dedup.bam \
      --bam SAMPLEn.sorted.dedup.bam

bedtools multicov -bed merged_peaks.bed \
      -p \
      -bams SAMPLE1.sorted.dedup.bam SAMPLE2.sorted.dedup.bam SAMPLE3.sorted.dedup.bam \
      > Raw_coverages.bedtools.txt

6) Generating Scaled Coverage Tracks with BAMscale

Scaled coverage tracks for each BAM file were generated with the BAMscale "scale" function. The default settings were utilized, meaning the coverages in bins were scaled to the number of aligned bases and divided by the gene length. As comparison, we generated coverage tracks with IGVtools (using the hg19 chromSize file) and deeptools bamcoverage

./BAMscale scale --bam SAMPLE1.sorted.dedup.bam

igvtools count -z 5 SAMPLE1.sorted.dedup.bam SAMPLE1.sorted.dedup.bam.tdf hg19.chrom.sizes

bamCoverage --ignoreDuplicates --binSize 5 --effectiveGenomeSize 2864785220 \
    -o SAMPLE1.sorted.dedup.bam.bw \
    --normalizeUsing RPGC \
    -b SAMPLE1.sorted.dedup.bam 
Clone this wiki locally