Skip to content
parasitehunter edited this page Sep 23, 2014 · 13 revisions

Once we have our VCF file we need to mask troublesome regions - hypervariable/highly paralogous gene families (from Dan Neafsey's Nature Genetics paper) and tandem repeats (determined using TandemRepeatFinder) - and filter low-quality SNPs. Fortunately, the GATK SelectVariants walker is a robust tool that can handle multiple (sometimes poorly formatted) lists of intervals to skip and can use Java regular expressions. Here is some generic code for how I am running this walker (as of 18 Sept 2014):

java -jar GenomeAnalysisTK.jar \
        -T SelectVariants \
        -R ref.fasta \
        -XL trfExclude.intervals \
        -XL neafseyExclude.intervals \
        --variant sample.vcf \
        -out sample.filtered.vcf

Region Masking

See these links for more information about how I performed hypervariable gene masking and tandem repeat masking.

Quality Filtering

In their 2012 analysis of 227 P. falciparum isolates from around the world, Manske et al. used some well-thought-through filtering criteria for keeping or throwing SNPs, which are described in their supplement. Here are my thoughts on how we should be adapting their methods:

  1. Rare Allele Filtering:
  2. something else here
Clone this wiki locally