Skip to content

Commit

Permalink
Paper (#18)
Browse files Browse the repository at this point in the history
* work on evaluation

* simulated regions

* add new test

* adjust plotting
  • Loading branch information
brentp authored Jan 28, 2019
1 parent dd361bf commit 98967ec
Show file tree
Hide file tree
Showing 15 changed files with 9,150 additions and 132 deletions.
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ v0.1.2 (dev)
+ reduce logging output.
+ support un-indexed VCFs without contig definitions.
+ use only PASS SNVs from --snps
+ set DHFFC flank distance to 1000 default and allow setting via `DUPHOLD_FLANK` env var and only use non-zero values in DHFFC flank. These 2 changes should improve DHFFC for samples with sparse coverage (e.g. for genomes without good assemblies).


v0.1.1
======
Expand Down
26 changes: 16 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,18 @@ single sample with:
+ **DHBFC**: fold-change for the variant depth *relative to bins in the genome with similar GC-content*.
+ **DHFFC**: fold-change for the variant depth *relative to **F**lanking regions*.

If a SNP/Indel VCF/BCF is given, `duphold` will annotate each DEL/DUP call with (see below for more detail on what it does):

+ **DHET**: counts of SNP heterozygotes in the SV supporting: [0] a normal heterozygote, [1] a triploid heterozygote.
for a DUP, we expect most hets to have an allele balance closer to 0.33 or 0.67 than to 0.5. A good heterozygous
DUP will have larger values of [1], than [0], though it's possible there are no HETs in small events.

+ **DHHU**: counts of [0] Hom-ref, [1] Hom-alt, [2] Unknown variants in the event. A heterozygous deletion may have more hom-alt SNP calls.
A homozygous deletion may have only unknown SNP calls.

It also adds **GCF** to the INFO field indicating the fraction of G or C bases in the variant.

After annotating with `duphold`, a sensible way to filter to high-quality variants is:

```
bcftools view -i '(SVTYPE = "DEL" & FMT/DHFFC[0] < 0.7) | (SVTYPE = "DUP" & FMT/DHFFC[0] > 1.3)" $svvcf
bcftools view -i '(SVTYPE = "DEL" & FMT/DHFFC[0] < 0.7) | (SVTYPE = "DUP" & FMT/DHBFC[0] > 1.3)" $svvcf
```

In our evaluations, `DHFFC` works best for deletions and `DHBFC` works slightly better for duplications.
For genomes/samples with more variable coverage, `DHFFC` should be the most reliable.


## SNP/Indel annotation

Expand All @@ -50,6 +44,18 @@ then query this data structure for each SV and count the number of heterozygotes
or a triploid HET (allele balance close to 0.33 or 0.67) into `DHET`. It will store the number of Hom-Ref, Hom-Alt, Unnkown calls in
`DHHU`.

When a SNP/Indel VCF/BCF is given, `duphold` will annotate each DEL/DUP call with:

+ **DHET**: counts of SNP heterozygotes in the SV supporting: [0] a normal heterozygote, [1] a triploid heterozygote.
for a DUP, we expect most hets to have an allele balance closer to 0.33 or 0.67 than to 0.5. A good heterozygous
DUP will have larger values of [1], than [0], though it's possible there are no HETs in small events.

+ **DHHU**: counts of [0] Hom-ref, [1] Hom-alt, [2] Unknown variants in the event. A heterozygous deletion may have more hom-alt SNP calls.
A homozygous deletion may have only unknown SNP calls.

In practice, this has not proven useful for us. The depth changes are more informative.


## Performance

### Speed
Expand Down
2 changes: 1 addition & 1 deletion docker/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ base=$(pwd)
git clone -b devel --depth 1000 git://github.com/nim-lang/nim nim
cd nim
# before unchecked was removed:
git checkout cc5b8c6
#git checkout cc5b8c6
sh build_all.sh
export PATH=$(base)/nim/bin:$PATH

Expand Down
102 changes: 102 additions & 0 deletions paper/dupdeldist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
import cyvcf2
import sys
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
sns.set_style('white')

SIZE_MIN = 0
SIZE_MAX = sys.maxint

colors = sns.color_palette()

metrics = {"DUP": [[], [], []],
"DEL": [[], [], []]}

metric_dup = "DHFFC"
metric_del = "DHFFC"

gcs = {"DUP":[], "DEL": []}

for v in cyvcf2.VCF(sys.argv[1], gts012=True):
if v.FILTER is not None: continue
svtype = v.INFO.get("SVTYPE")
if not svtype in ("DEL", "DUP"): continue

size = v.end - v.start
if size < SIZE_MIN: continue
if size >= SIZE_MAX: continue

gt = v.gt_types[0]

gcs[svtype].append(v.INFO.get("GCF"))
#if v.INFO.get("GCF") == 0.0: continue

metric = metric_dup if svtype == "DUP" else metric_del

val = v.format(metric)[0][0]
if np.isnan(val):
val = -1.0
if gt == 0: continue
metrics[svtype][gt].append(min(3, val))

fix, ax = plt.subplots(2, 2)
ax[0, 0].hist(metrics["DUP"][0], 40, label="0/0", alpha=0.7, color=colors[0])
ax[0, 0].hist(metrics["DUP"][1], 40, label="0/1", alpha=0.7, color=colors[1])
ax[0, 0].hist(metrics["DUP"][2], 40, label="1/1", alpha=0.7, color=colors[2])
ax[0, 0].set_ylabel("Count")
ax[0, 0].set_xlim(0, 2.5)
ax[0, 0].set_xlabel(metric_dup)
ax[0, 0].text(0.6, 0.24, "Duplications", transform=ax[0, 0].transAxes)
ax[0, 0].legend()

ax[1, 0].hist(metrics["DEL"][0], 40, label="0/0", alpha=0.7, color=colors[0])
ax[1, 0].hist(metrics["DEL"][1], 40, label="0/1", alpha=0.7, color=colors[1])
ax[1, 0].hist(metrics["DEL"][2], 40, label="1/1", alpha=0.7, color=colors[2])
ax[1, 0].text(0.6, 0.24, "Deletions", transform=ax[1, 0].transAxes)
ax[1, 0].set_xlim(0, 2.5)
ax[1, 0].set_xlabel(metric_del)
ax[1, 0].set_ylabel("Count")

from sklearn.metrics import auc, roc_curve

L = ["xx", "0/1", "1/1"]

print "> %d..%d" % (SIZE_MIN, SIZE_MAX)
for i, ev in enumerate(("DUP", "DEL")):

for alts in (1, 2):
truth = [0] * len(metrics[ev][0]) + [1] * len(metrics[ev][alts])
scores = metrics[ev][0] + metrics[ev][alts]
if ev == "DEL":
scores = [-s for s in scores]
try:
fpr, tpr, rscores = roc_curve(truth, scores)
except ValueError:
fpr = np.array([0, 1])
tpr = np.array([0, 1])
rscores = np.array([0, 1])
if ev == "DEL":
rscores = -rscores
idx = np.searchsorted(rscores, 0.7)
else:
idx = np.searchsorted(-rscores, -1.3)
ax[i, 1].plot(fpr, tpr, color=colors[alts])
ax[i, 1].plot([0, 1], [0, 1], '--', color='gray')
ax[i, 1].text(0.4, 0.04 + (2 - alts) * 0.1, "0/0 vs %s AUC: %.2f"
% (L[alts], auc(fpr, tpr)),
color=colors[alts])
print "%s\t%s\t%.2f\t%d" % (ev, L[alts], auc(fpr, tpr), len(truth))
ax[i, 1].plot([fpr[idx]], [tpr[idx]], marker='o', color=colors[alts])
ax[i, 1].text(0.4, 0.04 + 2 * 0.1, "Duplications" if ev == "DUP" else "Deletions")

ax[0, 1].set_xlabel("1 - Sensitivity")
ax[1, 1].set_xlabel("1 - Sensitivity")
ax[1, 1].set_ylabel("Specificity")
ax[0, 1].set_ylabel("specificity")

plt.tight_layout()
plt.savefig("figure1.png", dpi=600)
plt.savefig("figure1.eps", dpi=600)
plt.show()
plt.close()
97 changes: 55 additions & 42 deletions paper/evaluation.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,26 +7,30 @@ set -euo pipefail

#wait

ref=~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa
ref=/data/human/g1k_v37_decoy.fa
CRAM=/data/human/hg002.cram
truth=/data/human/HG002_SVs_Tier1_v0.6.vcf.gz
truth_del=/data/human/HG002_SVs_Tier1_v0.6.DEL.vcf.gz

bcftools view -i 'SVTYPE="DEL"' -O z -o $truth_del --apply-filters "PASS,." $truth
tabix $truth_del
<<DONE
pip install progressbar2 python-levenshtein "intervaltree<2.1.0" pyvcf pyfaidx pysam python-dateutil
~/Projects/src/bwa/bwa mem -R '@RG\tID:HG002\tSM:HG002\tPL:ILLUMINA\tPU:HG002\tLB:HG002' -t 32 $ref hg002_R1.fastq.gz hg002_R2.fastq.gz \
| samblaster \
| samtools sort -@ 4 -m 15G --output-fmt CRAM --reference $ref -o hg002.cram
bcftools view -i SVTYPE="DEL" -O z -o HG002_SVs_Tier1_v0.6.DEL.vcf.gz HG002_SVs_Tier1_v0.6.vcf.gz
tabix HG002_SVs_Tier1_v0.6.DEL.vcf.gz
bcftools view -i 'REPTYPE="DUP"' -O z -o HG002_SVs_Tier1_v0.6.DUP.vcf.gz HG002_SVs_Tier1_v0.6.vcf.gz
tabix HG002_SVs_Tier1_v0.6.DUP.vcf.gz
DONE

samtools index hg002.cram
smoove call -F --genotype -o lf-dev2/ -x -p 5 -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -n HG002 hg002.cram
duphold -v lf-dev2/HG002-smoove.genotyped.vcf.gz -o lf-dev2/HG002-smoove.genotyped.duphold.vcf.gz -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -t 4 -b hg002.cram
#docker run -v "$(pwd):/work" -v $(pwd)/src:/src/ -v $(dirname $CRAM):$(dirname $CRAM) -v $(dirname $ref):$(dirname $ref) brentp/smoove:v0.2.3 smoove call -d -F --genotype -o lf-dev2/ -x -p 5 -f $ref -n HG002 $CRAM

bcftools view -i 'SVTYPE="DEL"' lf-dev2/HG002-smoove.genotyped.duphold.vcf.gz -O z -o lf-dev2/HG002-smoove.genotyped.DEL.vcf.gz
bcftools view -i 'SVTYPE="DEL"' lf-dev2/HG002-smoove.genotyped.vcf.gz -O z -o lf-dev2/HG002-smoove.genotyped.DEL.vcf.gz
tabix lf-dev2/HG002-smoove.genotyped.DEL.vcf.gz
DONE

ev=DEL
filt="< 0.7"
Expand All @@ -37,72 +41,81 @@ rm -rf $eval/*

sizemax=15000000
sizemin=300
bed=/data/human/HG002_SVs_Tier1_v0.6.bed


python ~/Projects/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b HG002_SVs_Tier1_v0.6.$ev.vcf.gz -c lf-dev2/HG002-smoove.genotyped.$ev.vcf.gz -o $eval/unfiltered/ --passonly --pctsim=0 -r 20 --giabreport -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa --no-ref --includebed HG002_SVs_Tier1_v0.6.bed -O 0.6
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c lf-dev2/HG002-smoove.genotyped.$ev.vcf.gz -o $eval/unfiltered/ --passonly --pctsim=0 -r 20 --giabreport -f $ref --no-ref --includebed $bed -O 0.6

bcftools view -i "(FMT/DHBFC[0] $filt)" lf-dev2/HG002-smoove.genotyped.$ev.vcf.gz -O z -o lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHBFC.vcf.gz
tabix lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHBFC.vcf.gz
python ~/Projects/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b HG002_SVs_Tier1_v0.6.$ev.vcf.gz -c lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHBFC.vcf.gz -o $eval/dhbfc/ --passonly --pctsim=0 -r 20 --giabreport -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa --no-ref --includebed HG002_SVs_Tier1_v0.6.bed -O 0.6
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHBFC.vcf.gz -o $eval/dhbfc/ --passonly --pctsim=0 -r 20 --giabreport -f $ref --no-ref --includebed $bed -O 0.6

bcftools view -i "(FMT/DHFFC[0] $filt)" lf-dev2/HG002-smoove.genotyped.$ev.vcf.gz -O z -o lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFFC.vcf.gz
tabix lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFFC.vcf.gz
python ~/Projects/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b HG002_SVs_Tier1_v0.6.$ev.vcf.gz -c lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFFC.vcf.gz -o $eval/dhffc/ --passonly --pctsim=0 -r 20 --giabreport -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa --no-ref --includebed HG002_SVs_Tier1_v0.6.bed -O 0.6
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFFC.vcf.gz -o $eval/dhffc/ --passonly --pctsim=0 -r 20 --giabreport -f $ref --no-ref --includebed $bed -O 0.6

bcftools view -i "(FMT/DHFC[0] $filt)" lf-dev2/HG002-smoove.genotyped.$ev.vcf.gz -O z -o lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFC.vcf.gz
tabix lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFC.vcf.gz
python ~/Projects/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b HG002_SVs_Tier1_v0.6.$ev.vcf.gz -c lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFC.vcf.gz -o $eval/dhfc/ --passonly --pctsim=0 -r 20 --giabreport -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa --no-ref --includebed HG002_SVs_Tier1_v0.6.bed -O 0.6
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c lf-dev2/HG002-smoove.genotyped.$ev.duphold-DHFC.vcf.gz -o $eval/dhfc/ --passonly --pctsim=0 -r 20 --giabreport -f $ref --no-ref --includebed $bed -O 0.6

python figure1.py eval-DEL/{unfiltered,dhfc,dhbfc,dhffc}/summary.txt
python paper/table1.py eval-DEL/{unfiltered,dhfc,dhbfc,dhffc}/summary.txt
exit

#python figure1.py eval-DUP/{unfiltered,dhfc,dhbfc,dhffc}/summary.txt
export SAMPLOT_COVERAGE_ONLY=FALSE
rm -rf ~/public_html/samplot-fps/
~/Projects/src/samplot/src/samplot_vcf.sh -O png -o ~/public_html/samplot-fps/ -v $eval/dhffc/fp.vcf -r ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -S ~/Projects/src/samplot/src/samplot.py hg002.cram
~/src/samplot/src/samplot_vcf.sh -O png -o ~/public_html/samplot-fps/ -v $eval/dhffc/fp.vcf -r $ref -S ~/src/samplot/src/samplot.py $CRAM
export SAMPLOT_COVERAGE_ONLY=FALSE
rm -rf ~/public_html/samplot-tp/
~/Projects/src/samplot/src/samplot_vcf.sh -O pdf -o ~/public_html/samplot-tp/ -v $eval/dhffc/tp-call.vcf -r ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -S ~/Projects/src/samplot/src/samplot.py hg002.cram

### DUPS
#wget http://labshare.cshl.edu/shares/schatzlab/www-data/fsedlaze/Sniffles/GiaB/all_reads.fa.giab_h002_ngmlr-0.2.3_mapped.bam.sniffles1kb_auto_noalts.vcf.gz

bcftools view -h all_reads.fa.giab_h002_ngmlr-0.2.3_mapped.bam.sniffles1kb_auto_noalts.vcf.gz | grep -v '#CHROM' > sniffles.dups.vcf
zgrep $'INFO\tFORMAT' all_reads.fa.giab_h002_ngmlr-0.2.3_mapped.bam.sniffles1kb_auto_noalts.vcf.gz | perl -pe 's/(.*)\t.+$/\1\tHG002/' >> sniffles.dups.vcf
bcftools view --apply-filters PASS all_reads.fa.giab_h002_ngmlr-0.2.3_mapped.bam.sniffles1kb_auto_noalts.vcf.gz | awk '$5 == "<DUP>"' >> sniffles.dups.vcf
gsort sniffles.dups.vcf ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa.fai | bgzip -c > sniffles.dups.vcf.gz
tabix -f sniffles.dups.vcf.gz

duphold -v sniffles.dups.vcf.gz -o duphold.annotated.dups.vcf.gz -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -t 6 -b hg002.cram

unset SAMPLOT_COVERAGE_ONLY
~/Projects/src/samplot/src/samplot_vcf.sh -O png -o ~/public_html/samplot-dups/ -v duphold.annotated.dups.vcf.gz -r ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -S ~/Projects/src/samplot/src/samplot.py hg002.cram

~/src/samplot/src/samplot_vcf.sh -O pdf -o ~/public_html/samplot-tp/ -v $eval/dhffc/tp-call.vcf -r $ref -S ~/src/samplot/src/samplot.py $CRAM


### MISSED (False Negative)
eval=missed
bcftools view -i '(FMT/DHFFC[0] > 0.85)' lf-dev2/HG002-smoove.genotyped.DEL.vcf.gz -O z -o lf-dev2/HG002-smoove.genotyped.DEL.duphold-no-support.vcf.gz
tabix lf-dev2/HG002-smoove.genotyped.DEL.duphold-no-support.vcf.gz
rm -rf $eval-no-support
python ~/Projects/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b HG002_SVs_Tier1_v0.6.DEL.vcf.gz -c \
python ~/src/truvari/truvari.py --sizemax $sizemax -s $sizemin -S $((sizemin - 30)) -b $truth_del -c \
lf-dev2/HG002-smoove.genotyped.DEL.duphold-no-support.vcf.gz -o $eval-no-support --passonly --pctsim=0 \
-r 20 --giabreport -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa --no-ref \
--includebed HG002_SVs_Tier1_v0.6.bed -O 0.95
-r 20 --giabreport -f $ref --no-ref \
--includebed $bed -O 0.95

rm -rf ~/public_html/samplot-tp-missed
~/Projects/src/samplot/src/samplot_vcf.sh -o ~/public_html/samplot-tp-missed/ -v $eval-no-support/tp-call.vcf \
-r ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa \
-S ~/Projects/src/samplot/src/samplot.py hg002.cram
~/src/samplot/src/samplot_vcf.sh -o ~/public_html/samplot-tp-missed/ -v $eval-no-support/tp-call.vcf \
-r $ref \
-S ~/src/samplot/src/samplot.py $CRAM

### DUPS

duphold -v dups.vcf.gz -o duphold.annotated.dups.vcf.gz -f $ref -t 6 -b $CRAM


## speed

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz
zcat ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz | cut -f 1-10 | perl -pe 's/HG00096/HG002/' | bgzip -c > ALL.wgs.merged-svs.vcf.gz

time duphold -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -t 3 -b hg002.cram -o all.vcf -v ALL.wgs.merged-svs.vcf.gz

time duphold -f $ref -t 3 -b $CRAM -o all.vcf -v ALL.wgs.merged-svs.vcf.gz

for n in 100 1000 5000 10000 20000 40000 60000; do
time duphold -f ~/bcbio/genomes/Hsapiens/g1k_v37_decoy/seq/g1k_v37_decoy.fa -t 3 -b hg002.cram -o all.vcf -v ALL.wgs.merged-svs.vcf.gz.$n.vcf.gz
time duphold -f $ref -t 3 -b $CRAM -o all.vcf -v ALL.wgs.merged-svs.vcf.gz.$n.vcf.gz
done


######## distribution of del, dup calls for truth:

struth=paper-results/giab-with-fake.vcf.gz
nim c -d:release paper/giab_ins_to_dup.nim
# find what appear to be DUP calls
./paper/giab_ins_to_dup /data/human/HG002_SVs_Tier1_v0.6.vcf.gz /data/human/g1k_v37_decoy.fa | bgzip -c > $struth
# insert fake, 0/0 calls.
nim c -r paper/insert_regions.nim $struth /data/human/HG002_SVs_Tier1_v0.6.bed t.vcf.gz /data/human/g1k_v37_decoy.fa
mv t.vcf.gz $struth

mkdir -p paper-results/
for f in 500 5000 50000; do
DUPHOLD_FLANK=$f ./src/duphold -f /data/human/g1k_v37_decoy.fa -v $struth -t 3 -o paper-results/giab.duphold.flank-$f.vcf.gz -b /data/human/hg002.cram &
done
wait

for step in 100 250 5000; do
echo "DUPHOLD_GC_STEP=$step ./src/duphold -f /data/human/g1k_v37_decoy.fa -v $struth -t 3 -o paper-results/giab.duphold.flank-step-$step.vcf.gz -b /data/human/hg002.cram"
done | gargs -p 3 "{}"
57 changes: 0 additions & 57 deletions paper/figure1.py

This file was deleted.

Loading

0 comments on commit 98967ec

Please sign in to comment.