-
Notifications
You must be signed in to change notification settings - Fork 0
SelectVariants
Once we have our VCF file we need to mask troublesome regions - hypervariable/highly paralogous gene families (from Dan Neafsey's Nature Genetics paper) and tandem repeats (determined using TandemRepeatFinder) - and filter low-quality SNPs. Fortunately, the GATK SelectVariants
walker is a robust tool that can handle multiple (sometimes poorly formatted) lists of intervals to skip and can use Java regular expressions. Here is some generic code for how I am running this walker (as of 18 Sept 2014):
java -jar GenomeAnalysisTK.jar \
-T SelectVariants \
-R ref.fasta \
-XL trfExclude.intervals \
-XL neafseyExclude.intervals \
--variant sample.vcf \
-out sample.filtered.vcf
See these links for more information about how I performed hypervariable gene masking and tandem repeat masking.
In their 2012 analysis of 227 P. falciparum isolates from around the world, Manske et al. used some well-thought-through filtering criteria for keeping or throwing SNPs, which are described in their supplement. Here are my thoughts on how we should be adapting their methods:
- Rare Allele Filtering:
- something else here