Variant filtering and comparison

Introduction

This section of the tutorial deals with filtering and comparing sets of variants. It assumes you have completed Parts 2, 4a, and 4b so that you have 3 VCF files from each of the three variant calling approaches in the directory structure established in the tutorial.

Steps here will use the following software packages:

bcftools
vcflib
vt

Each major step below has an associated bash script tailored to the UConn CBC Xanadu cluster with appropriate headers for the Slurm job scheduler. The code can easily be modified to run interactively, or in other contexts.

Motivation

Once you call variants on a given set of samples, you face a couple of issues:

To varying degrees, most variant callers have high sensitivity, but low specificity, outputting many false positive results. Thus, depending on your application, raw variant sets may need to be filtered.
You may want to compare variant sets generated by different callers, or connect them with those found in a database. Different variant callers, however, may output variants in slightly different ways due to the treatment of haplotype alleles and the ambiguity of representation inherent to some complex variants. In order to ensure apples-to-apples comparisons, you may need to break down haplotype variants into their constituent pieces and normalize their representation in your VCF file.

In this section of the tutorial, we'll cover ways you can filter and compare sets of variants.

Update your working directory

First we'll create a new directory to hold the results we'll generate. Make sure you're in the directory vc_workshop and type:

mkdir -p filtered_vcfs
cd filtered_vcfs

Variant filtering

There are two main approaches to dealing with the possibility of sequencing variation being artifactual. The most common is called hard filtering. In hard filtering, you make a binary choice. First, set thresholds on some summary statistics describing your variants. Then exclude from analysis any variants that do not meet those thresholds. Most variant calling tools output a wide array of statistics that can be used for this purpose. As we saw in Part 2, these statistics are often found in the INFO field (column 8 of the VCF), but a key quantity is the variant quality (column 6 of the VCF, QUAL). The variant quality accounts for all the factors the variant caller uses to judge the evidence of variation at the site and summarizes it in a single phred-scaled quality score. The main weaknesses of this approach are 1) that there is generally no threshold or combination of thresholds that perfectly discriminates true positives from true negatives and 2) certain kinds of variants (e.g. rare ones) may be systematically less likely to pass filters, resulting in a biased set.

The second approach is to use a statistical model to directly incorporate the uncertainty in whether variation at a site is real, and in the genotypes of individuals at that site, into downstream analyses. These approaches do not make binary decisions on variants or genotypes, but essentially average the results of downstream analyses over all possibilities, weighted by their probabilities, conditional on the model. The main weakness of this approach (aside from tools not being available in every case) is that it is conditioned on the variant calling model accurately characterizing the uncertainty.

Which of these approaches you take will depend on the sensitivity of your analyses to the presence of false positives and false negatives, and whether or not tools are available to account for uncertainty. It has been shown that for some applications in population genetics, hard filtering can cause biases (see Korneliussen et al.) but there are likely to be others.

Here we will demonstrate a hard filtering approach.

In this example, we'll use bcftools to set a simple threshold on the quality score. We could exclude all the low quality variants from the new file, but for this instance, we'll simply use the FILTER field (column 7 in the VCF) to flag them as either PASS or LowQual:

bcftools filter -s LowQual -e '%QUAL<20' ../variants_freebayes/chinesetrio_fb.vcf.gz | \
bgzip -c > fb_filter.vcf.gz

In this case -s means "soft filter", or set values in the FILTER field and leave all the variants in the file. %QUAL<20 is our filtering threshold (see the bcftools documentation for how to specify other expressions).

In the end we get results like this:

chr20	34398628	.	A	C	0	LowQual	AB=0;ABP=0;AC=0;AF=0;AN=6;AO=26;CIGAR=1X;DP=169;DPB=169;DPRA=0;EPP=43.4331;EPPR=3.07324;GTI=0;LEN=1;MEANALT=2;MQM=52.6538;MQMR=59.4203;NS=3;NUMALT=1;ODDS=50.5069;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=260;QR=4556;RO=138;RPL=2;RPP=43.4331;RPPR=4.01736;RPR=24;RUN=1;SAF=0;SAP=59.4686;SAR=26;SRF=113;SRP=124.865;SRR=25;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	0/0:65:50,15:50:1718:15:152:0,-5.78839,-141.074	0/0:62:51,10:51:1723:10:93:0,-10.0033,-146.776	0/0:42:37,1:37:1115:1:15:0,-10.0897,-99.0743
chr20	34399330	.	G	T	5725.11	PASS	AB=0.487603;ABP=3.17181;AC=5;AF=0.833333;AN=6;AO=216;CIGAR=1X;DP=280;DPB=280;DPRA=0;EPP=12.0581;EPPR=8.83536;GTI=0;LEN=1;MEANALT=1.33333;MQM=59.287;MQMR=60;NS=3;NUMALT=1;ODDS=44.2501;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=6932;QR=1912;RO=63;RPL=112;RPP=3.6537;RPPR=5.80219;RPR=104;RUN=1;SAF=79;SAP=36.829;SAR=137;SRF=20;SRP=21.2438;SRR=43;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	0/1:121:62,59:62:1897:59:1915:-135.887,0,-134.53	1/1:100:0,99:0:0:99:3068:-276.127,-29.802,0	1/1:59:1,58:1:15:58:1949:-174.125,-16.2608,0

Here you can see one variant where the FILTER field has been set to "LowQual" and the other has been set to "PASS". There are a number of other software tools such as vcftools and vcflib that can also be used to accomplish this kind of filtering.

We can quickly check how many variants we've filtered like this:

module load bcftools
bcftools view -H fb_filter.vcf.gz | cut -f 7 | sort | uniq -c
bcftools view -H gatk_filter.vcf.gz | cut -f 7 | sort | uniq -c
bcftools view -H bcf_filter.vcf.gz | cut -f 7 | sort | uniq -c

cut isolates column 7 and the sort | uniq -c pair of commands sorts the output and then counts the occurrences of each unique value.

When you run these commands you will see dramatically different counts of variants between the methods. This is a result of two factors:

Haplotype callers can return alleles spanning multiple nucleotides in a single VCF record, while bcftools is more likely to return alleles for a single nucleotide.
Haplotype callers are better at weeding out errors and avoid returning a great deal of spurious variation.

There are many ways we can inspect these results (short of doing more lab work to validate individual variants) to get an idea of the quality of our call set. Here are a few:

Test for Hardy-Weinberg equilibrium. If you have a population from a single population, variants that are not in equilibrium within that may be problematic.
Test for mendelian errors. If you have a pedigree for your samples (as we do in this example data set), you can ask how frequently, or which loci violate mendelian inheritance.
Evaluate bulk statistics like the transition/transversion ratio for SNPs. ts/tv in humans should be around 2.1. The value expected from random error is 0.5. A set riddled with errors will have a lower than expected ts/tv ratio.
Compare against a known set of variants. The reliability and expected similarity to the focal samples of the set of knowns both factor into how this comparison is interpreted.

Here we'll briefly look at 2 and 3 using vt.

Let's try vt peek to summarize the output from freebayes:

module load vt

vt peek -f 'QUAL > 50' fb_filter.vcf.gz

You'll see a whole lot of output. It counts up the different types of variants. For SNPs it gives the ts/tv ratio, for indels the insertion/deletion ratio. You can see that our SNP ts/tv is 1.8. Lower than we expect. If we reverse the ">" to look at low quality variants...

vt peek -f 'QUAL < 50' fb_filter.vcf.gz

If we have a look at bcftools and GATK variant sets:

vt peek -f 'QUAL > 50' bcf_filter.vcf.gz

We see that they both have many more 'high quality' variants, but ts/tv is considerably lower for both. We could set more filters to improve these variant call sets.

Let's also have a look at mendelian violations, also using vt:

# create a pedigree file:
echo 'giabct son dad mom 1 -9' >ct.ped
echo 'giabct son 0 0 1 -9' >>ct.ped
echo 'giabct son 0 0 1 -9' >>ct.ped
cat ct.ped | sed 's/ /\t/' >ct2.ped && mv ct2.ped ct.ped

# run vt
vt profile_mendelian -f 'QUAL > 50' -p ct.ped fb_filter.vcf.gz
vt profile_mendelian -f 'QUAL > 50' -p ct.ped gatk_filter.vcf.gz
vt profile_mendelian -f 'QUAL > 50' -p ct.ped bcf_filter.vcf.gz

Again, we see that for biallelic loci filtered at Q > 50, freebayes has the lowest mendelian error rate, while bcftools has the highest.

scripts:

scripts/Part5a_filter.sh

Variant comparison

You may wish to explore the results from different variant callers to understand their performance. In that case, you might want to compare variant call sets from different approaches, or to a known true set of variants. As is mentioned above, the representation of variants may differ between variant callers. One way to handle this is to break down haplotype variants into their constituent parts and/or normalize them. Here we'll use the tool vcfallelicprimitives from vcflib to break down haplotype variants and vt to compare different sets (if you wanted to extract the intersection of, or unique variants from a pair of files, you could use bcftools isec or vcflib's vcfintersect).

As an example, vcfallelicprimitives takes this single VCF record, produced by freebayes

chr20	33089668	.	AGT	GGC,AGC	3627.68	PASS	AB=0.430168,0.601942;ABP=10.5923,12.3076;AC=3,2;AF=0.5,0.333333;AN=6;AO=77,62;CIGAR=1X1M1X,2M1X;DP=179;DPB=179.667;DPRA=0,0;EPP=30.1114,8.05372;EPPR=3.5385;GTI=0;LEN=3,1;MEANALT=2.66667,3;MQM=60,60;MQMR=60;NS=3;NUMALT=2;ODDS=82.337;PAIRED=1,1;PAIREDR=1;PAO=1.5,0.5;PQA=45,14;PQR=0;PRO=0;QA=2603,2183;QR=1197;RO=37;RPL=35,34;RPP=4.39215,4.27115;RPPR=3.5385;RPR=42,28;RUN=1,1;SAF=37,29;SAP=3.26411,3.57068;SAR=40,33;SRF=20;SRP=3.5385;SRR=17;TYPE=complex,snp	GT:DP:AD:RO:QR:AO:QA:GL	1/2:67:0,25,41:0:0:25,41:859,1416:-186.837,-115.271,-106.842,-72.4793,0,-59.535	0/1:76:37,38,0:37:1197:38,0:1253,0:-90.3955,0,-85.3462,-101.534,-96.7853,-198.072	1/2:36:0,14,21:0:0:14,21:491,767:-102.821,-62.8774,-58.663,-40.1512,0,-33.8295

and returns these two records:

chr20	33089668	.	A	G	3627.68	PASS	AC=3;AF=0.5;LEN=1;TYPE=snp	GT	1|0	0|1	1|0
chr20	33089670	.	T	C	3627.68	PASS	AC=5;AF=0.833333;LEN=1;TYPE=snp	GT	1|1	0|1	1|1

The two records match the way these sites are represented by bcftools and gatk. The INFO field, genotype likelihood and allele coverage information are stripped out in this process because the data they contain may no longer apply to the genotypes in the new VCF records (though there is an option to retain that information).

To ask how the variant callsets compare, we can do:

module load vcflib
vt partition -f 'QUAL > 50' <(vcfallelicprimitives fb_filter.vcf.gz) bcf_filter.vcf.gz
vt partition -f 'QUAL > 50' <(vcfallelicprimitives gatk_filter.vcf.gz) bcf_filter.vcf.gz
vt partition -f 'QUAL > 50' <(vcfallelicprimitives gatk_filter.vcf.gz) <(vcfallelicprimitives fb_filter.vcf.gz)

Here the code <(vcfallelicprimitives fb_filter.vcf.gz) is called a process substitution. This allows us to, essentially, pipe the results of the command that breaks up haplotype alleles to vt without having to write a new vcf file to the disk.

We can see that intersecting any pair of variant calls results in a very consistent 1.75 ts/tv.

If we want to look at the intersection of all three VCF files:

vt multi_partition -f 'QUAL > 50' <(vcfallelicprimitives gatk_filter.vcf.gz) <(vcfallelicprimitives fb_filter.vcf.gz) bcf_filter.vcf.gz

We can see there is a core set of variants that are called by all three methods, and for these, dramatically lowering or raising the quality score threshold does not impact their number.

In this case, where variant callers disagree quite strongly on a fairly large number of variants, it may be worth further exploring the variants where there is disagreement, focusing on statistics other than the quality score.

For reference, another commonly used tool for evaluating VCF files is vcfeval.

A discussion of the issue of variant normalization can be found here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part5_filtering_comparing.md

Part5_filtering_comparing.md

Variant filtering and comparison

Introduction

Contents

Motivation

Update your working directory

Variant filtering

Variant comparison

Files

Part5_filtering_comparing.md

Latest commit

History

Part5_filtering_comparing.md

File metadata and controls

Variant filtering and comparison

Introduction

Contents

Motivation

Update your working directory

Variant filtering

Variant comparison