This section of the tutorial deals with filtering and comparing sets of variants. It assumes you have completed Parts 2, 4a, and 4b so that you have 3 VCF files from each of the three variant calling approaches in the directory structure established in the tutorial.
Steps here will use the following software packages:
Each major step below has an associated bash script tailored to the UConn CBC Xanadu cluster with appropriate headers for the Slurm job scheduler. The code can easily be modified to run interactively, or in other contexts.
Once you call variants on a given set of samples, you face a couple of issues:
- To varying degrees, most variant callers have high sensitivity, but low specificity, outputting many false positive results. Thus, depending on your application, raw variant sets may need to be filtered.
- You may want to compare variant sets generated by different callers, or connect them with those found in a database. Different variant callers, however, may output variants in slightly different ways due to the treatment of haplotype alleles and the ambiguity of representation inherent to some complex variants. In order to ensure apples-to-apples comparisons, you may need to break down haplotype variants into their constituent pieces and normalize their representation in your VCF file.
In this section of the tutorial, we'll cover ways you can filter and compare sets of variants.
First we'll create a new directory to hold the results we'll generate. Make sure you're in the directory vc_workshop and type:
mkdir -p filtered_vcfs
cd filtered_vcfs
There are two main approaches to dealing with the possibility of sequencing variation being artifactual. The most common is called hard filtering. In hard filtering, you make a binary choice. First, set thresholds on some summary statistics describing your variants. Then exclude from analysis any variants that do not meet those thresholds. Most variant calling tools output a wide array of statistics that can be used for this purpose. As we saw in Part 2, these statistics are often found in the INFO field (column 8 of the VCF), but a key quantity is the variant quality (column 6 of the VCF, QUAL). The variant quality accounts for all the factors the variant caller uses to judge the evidence of variation at the site and summarizes it in a single phred-scaled quality score. The main weaknesses of this approach are 1) that there is generally no threshold or combination of thresholds that perfectly discriminates true positives from true negatives and 2) certain kinds of variants (e.g. rare ones) may be systematically less likely to pass filters, resulting in a biased set.
The second approach is to use a statistical model to directly incorporate the uncertainty in whether variation at a site is real, and in the genotypes of individuals at that site, into downstream analyses. These approaches do not make binary decisions on variants or genotypes, but essentially average the results of downstream analyses over all possibilities, weighted by their probabilities, conditional on the model. The main weakness of this approach (aside from tools not being available in every case) is that it is conditioned on the variant calling model accurately characterizing the uncertainty.
Which of these approaches you take will depend on the sensitivity of your analyses to the presence of false positives and false negatives, and whether or not tools are available to account for uncertainty. It has been shown that for some applications in population genetics, hard filtering can cause biases (see Korneliussen et al.) but there are likely to be others.
Here we will demonstrate a hard filtering approach.
In this example, we'll use bcftools
to set a simple threshold on the quality score. We could exclude all the low quality variants from the new file, but for this instance, we'll simply use the FILTER field (column 7 in the VCF) to flag them as either PASS or LowQual:
bcftools filter -s LowQual -e '%QUAL<20' ../variants_freebayes/chinesetrio_fb.vcf.gz | \
bgzip -c > fb_filter.vcf.gz
In this case -s
means "soft filter", or set values in the FILTER field and leave all the variants in the file. %QUAL<20
is our filtering threshold (see the bcftools
documentation for how to specify other expressions).
In the end we get results like this:
chr20 34398628 . A C 0 LowQual AB=0;ABP=0;AC=0;AF=0;AN=6;AO=26;CIGAR=1X;DP=169;DPB=169;DPRA=0;EPP=43.4331;EPPR=3.07324;GTI=0;LEN=1;MEANALT=2;MQM=52.6538;MQMR=59.4203;NS=3;NUMALT=1;ODDS=50.5069;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=260;QR=4556;RO=138;RPL=2;RPP=43.4331;RPPR=4.01736;RPR=24;RUN=1;SAF=0;SAP=59.4686;SAR=26;SRF=113;SRP=124.865;SRR=25;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0/0:65:50,15:50:1718:15:152:0,-5.78839,-141.074 0/0:62:51,10:51:1723:10:93:0,-10.0033,-146.776 0/0:42:37,1:37:1115:1:15:0,-10.0897,-99.0743
chr20 34399330 . G T 5725.11 PASS AB=0.487603;ABP=3.17181;AC=5;AF=0.833333;AN=6;AO=216;CIGAR=1X;DP=280;DPB=280;DPRA=0;EPP=12.0581;EPPR=8.83536;GTI=0;LEN=1;MEANALT=1.33333;MQM=59.287;MQMR=60;NS=3;NUMALT=1;ODDS=44.2501;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=6932;QR=1912;RO=63;RPL=112;RPP=3.6537;RPPR=5.80219;RPR=104;RUN=1;SAF=79;SAP=36.829;SAR=137;SRF=20;SRP=21.2438;SRR=43;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0/1:121:62,59:62:1897:59:1915:-135.887,0,-134.53 1/1:100:0,99:0:0:99:3068:-276.127,-29.802,0 1/1:59:1,58:1:15:58:1949:-174.125,-16.2608,0
Here you can see one variant where the FILTER field has been set to "LowQual" and the other has been set to "PASS". There are a number of other software tools such as vcftools and vcflib that can also be used to accomplish this kind of filtering.
We can quickly check how many variants we've filtered like this:
module load bcftools
bcftools view -H fb_filter.vcf.gz | cut -f 7 | sort | uniq -c
bcftools view -H gatk_filter.vcf.gz | cut -f 7 | sort | uniq -c
bcftools view -H bcf_filter.vcf.gz | cut -f 7 | sort | uniq -c
cut
isolates column 7 and the sort | uniq -c
pair of commands sorts the output and then counts the occurrences of each unique value.
When you run these commands you will see dramatically different counts of variants between the methods. This is a result of two factors:
- Haplotype callers can return alleles spanning multiple nucleotides in a single VCF record, while bcftools is more likely to return alleles for a single nucleotide.
- Haplotype callers are better at weeding out errors and avoid returning a great deal of spurious variation.
There are many ways we can inspect these results (short of doing more lab work to validate individual variants) to get an idea of the quality of our call set. Here are a few:
- Test for Hardy-Weinberg equilibrium. If you have a population from a single population, variants that are not in equilibrium within that may be problematic.
- Test for mendelian errors. If you have a pedigree for your samples (as we do in this example data set), you can ask how frequently, or which loci violate mendelian inheritance.
- Evaluate bulk statistics like the transition/transversion ratio for SNPs. ts/tv in humans should be around 2.1. The value expected from random error is 0.5. A set riddled with errors will have a lower than expected ts/tv ratio.
- Compare against a known set of variants. The reliability and expected similarity to the focal samples of the set of knowns both factor into how this comparison is interpreted.
Here we'll briefly look at 2 and 3 using vt
.
Let's try vt peek
to summarize the output from freebayes
:
module load vt
vt peek -f 'QUAL > 50' fb_filter.vcf.gz
You'll see a whole lot of output. It counts up the different types of variants. For SNPs it gives the ts/tv ratio, for indels the insertion/deletion ratio. You can see that our SNP ts/tv is 1.8. Lower than we expect. If we reverse the ">" to look at low quality variants...
vt peek -f 'QUAL < 50' fb_filter.vcf.gz
If we have a look at bcftools
and GATK
variant sets:
vt peek -f 'QUAL > 50' bcf_filter.vcf.gz
We see that they both have many more 'high quality' variants, but ts/tv is considerably lower for both. We could set more filters to improve these variant call sets.
Let's also have a look at mendelian violations, also using vt
:
# create a pedigree file:
echo 'giabct son dad mom 1 -9' >ct.ped
echo 'giabct son 0 0 1 -9' >>ct.ped
echo 'giabct son 0 0 1 -9' >>ct.ped
cat ct.ped | sed 's/ /\t/' >ct2.ped && mv ct2.ped ct.ped
# run vt
vt profile_mendelian -f 'QUAL > 50' -p ct.ped fb_filter.vcf.gz
vt profile_mendelian -f 'QUAL > 50' -p ct.ped gatk_filter.vcf.gz
vt profile_mendelian -f 'QUAL > 50' -p ct.ped bcf_filter.vcf.gz
Again, we see that for biallelic loci filtered at Q > 50, freebayes
has the lowest mendelian error rate, while bcftools
has the highest.
scripts:
You may wish to explore the results from different variant callers to understand their performance. In that case, you might want to compare variant call sets from different approaches, or to a known true set of variants. As is mentioned above, the representation of variants may differ between variant callers. One way to handle this is to break down haplotype variants into their constituent parts and/or normalize them. Here we'll use the tool vcfallelicprimitives
from vcflib
to break down haplotype variants and vt
to compare different sets (if you wanted to extract the intersection of, or unique variants from a pair of files, you could use bcftools isec
or vcflib's vcfintersect
).
As an example, vcfallelicprimitives
takes this single VCF record, produced by freebayes
chr20 33089668 . AGT GGC,AGC 3627.68 PASS AB=0.430168,0.601942;ABP=10.5923,12.3076;AC=3,2;AF=0.5,0.333333;AN=6;AO=77,62;CIGAR=1X1M1X,2M1X;DP=179;DPB=179.667;DPRA=0,0;EPP=30.1114,8.05372;EPPR=3.5385;GTI=0;LEN=3,1;MEANALT=2.66667,3;MQM=60,60;MQMR=60;NS=3;NUMALT=2;ODDS=82.337;PAIRED=1,1;PAIREDR=1;PAO=1.5,0.5;PQA=45,14;PQR=0;PRO=0;QA=2603,2183;QR=1197;RO=37;RPL=35,34;RPP=4.39215,4.27115;RPPR=3.5385;RPR=42,28;RUN=1,1;SAF=37,29;SAP=3.26411,3.57068;SAR=40,33;SRF=20;SRP=3.5385;SRR=17;TYPE=complex,snp GT:DP:AD:RO:QR:AO:QA:GL 1/2:67:0,25,41:0:0:25,41:859,1416:-186.837,-115.271,-106.842,-72.4793,0,-59.535 0/1:76:37,38,0:37:1197:38,0:1253,0:-90.3955,0,-85.3462,-101.534,-96.7853,-198.072 1/2:36:0,14,21:0:0:14,21:491,767:-102.821,-62.8774,-58.663,-40.1512,0,-33.8295
and returns these two records:
chr20 33089668 . A G 3627.68 PASS AC=3;AF=0.5;LEN=1;TYPE=snp GT 1|0 0|1 1|0
chr20 33089670 . T C 3627.68 PASS AC=5;AF=0.833333;LEN=1;TYPE=snp GT 1|1 0|1 1|1
The two records match the way these sites are represented by bcftools
and gatk
. The INFO field, genotype likelihood and allele coverage information are stripped out in this process because the data they contain may no longer apply to the genotypes in the new VCF records (though there is an option to retain that information).
To ask how the variant callsets compare, we can do:
module load vcflib
vt partition -f 'QUAL > 50' <(vcfallelicprimitives fb_filter.vcf.gz) bcf_filter.vcf.gz
vt partition -f 'QUAL > 50' <(vcfallelicprimitives gatk_filter.vcf.gz) bcf_filter.vcf.gz
vt partition -f 'QUAL > 50' <(vcfallelicprimitives gatk_filter.vcf.gz) <(vcfallelicprimitives fb_filter.vcf.gz)
Here the code <(vcfallelicprimitives fb_filter.vcf.gz)
is called a process substitution. This allows us to, essentially, pipe the results of the command that breaks up haplotype alleles to vt
without having to write a new vcf file to the disk.
We can see that intersecting any pair of variant calls results in a very consistent 1.75 ts/tv.
If we want to look at the intersection of all three VCF files:
vt multi_partition -f 'QUAL > 50' <(vcfallelicprimitives gatk_filter.vcf.gz) <(vcfallelicprimitives fb_filter.vcf.gz) bcf_filter.vcf.gz
We can see there is a core set of variants that are called by all three methods, and for these, dramatically lowering or raising the quality score threshold does not impact their number.
In this case, where variant callers disagree quite strongly on a fairly large number of variants, it may be worth further exploring the variants where there is disagreement, focusing on statistics other than the quality score.
For reference, another commonly used tool for evaluating VCF files is vcfeval.
A discussion of the issue of variant normalization can be found here