diff --git a/.gitignore b/.gitignore index 4cecece..75ba058 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,7 @@ data/ *.bcf src/duphold +*.bam* +*.cram* +syndip/ +*.vcf* diff --git a/README.md b/README.md index cbfb9f1..b5da78e 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ single sample with: + **DHBFC**: fold-change for the variant depth *relative to bins in the genome with similar GC-content*. + **DHD**: rapid change in depth at one of the break-points (1 for higher (DUP). 0 for no or conflicting changes. -1 for drop (DUP), 2 or -2 for both break points) -If a SNP/Indel VCF is given, `duphold` will annotate each DEL/DUP call with: +If a SNP/Indel VCF/BCF is given, `duphold` will annotate each DEL/DUP call with (see below for more detail on what it does): + **DHET**: counts of SNP heterozygotes in the SV supporting: [0] a normal heterozygote, [1] a triploid heterozygote. for a DUP, we expect most hets to have an allele balance closer to 0.33 or 0.67 than to 0.5. A good heterozygous @@ -29,6 +29,20 @@ If a SNP/Indel VCF is given, `duphold` will annotate each DEL/DUP call with: It also adds **GCF** to the INFO field indicating the fraction of G or C bases in the variant. + +## SNP/Indel annotation + +**NOTE** it is strongly recommended to use BCF for the `--snp` argument as otherwise VCF parsing will be a bottleneck. + ++ A DEL call with many HETs is unlikely to be valid. ++ A DUP call that has many HETs that have a 0.5 allele balance is unlikely to be valid. + +When the user specifies a `--snp` VCF, `duphold` finds the appropriate sample in that file and extracts high (> 20) quality, bi-allelic +SNP calls. For each chromosome, it will store a minimal (low-memory representation) in a sorted data-structure for fast access. It will +then query this data structure for each SV and count the number of heterozygotes supporting a diploid HET (allele balance close to 0.5) +or a triploid HET (allele balance close to 0.33 or 0.67) into `DHET`. It will store the number of Hom-Ref, Hom-Alt, Unnkown calls in +`DHHU`. + ## Performance ### Speed @@ -48,10 +62,13 @@ coming soon. ## Usage ``` -duphold -t 4 -v $svvcf -b $cram -f $fasta -o $output.bcf -duphold --threads 4 --vcf $svvcf --bam $cram --fasta $fasta --output $output.bcf +duphold -s $gatk_vcf -t 4 -v $svvcf -b $cram -f $fasta -o $output.bcf +duphold --snp $gatk_bcf --threads 4 --vcf $svvcf --bam $cram --fasta $fasta --output $output.bcf ``` +`--snp` can be a multi-sample VCF/BCF. `duphold` will be much faster with a BCF, especially if +the snp/indel file contains many (>20 or so) samples. + the threads are decompression threads so increasing up to about 4 works. ## Examples diff --git a/nim.cfg b/nim.cfg index b81c897..0c5211f 100644 --- a/nim.cfg +++ b/nim.cfg @@ -1 +1,2 @@ threads:on +--passC:"-flto" diff --git a/src/duphold.nim b/src/duphold.nim index 2ceac60..75e9429 100644 --- a/src/duphold.nim +++ b/src/duphold.nim @@ -665,7 +665,7 @@ Options: -v --vcf path to sorted SV VCF/BCF -b --bam path to indexed BAM/CRAM -f --fasta indexed fasta reference. - -s --snp optional path to snp/indel VCF with which to annotate SVs. + -s --snp optional path to snp/indel VCF/BCF with which to annotate SVs. BCF is highly recommended as it's much faster to parse. -t --threads number of decompression threads. [default: 4] -o --output output VCF/BCF (default is VCF to stdout) [default: -] -d --drop drop all samples from a multi-sample --vcf *except* the sample in --bam. useful for parallelization by sample followed by merge.