-
Notifications
You must be signed in to change notification settings - Fork 0
virMasking
Hypervariable and highly paralogous genes should be filtered from our genome. DN discovered that there are six especially complex gene families - msp3, pvfam-c, vir, SERA, msp7, and pvstp.
To filter these genes (and their associated intervals) from our VCF files, I searched PlasmoDB on 2014-09-08 using the following approaches:
- Search for "merozoite" and basket all msp3- and msp7-related hits (n=22).
- Search "SERA" and "serine-repeat" and basket all "serine-repeat" hits(n=13)
- Search "vir" and "variable surface" and basket all "vir", "vir-like", and "vir-related" hits(n=398)
- Search "fam-c" and basket all "fam-c" hits (n=7)
- I cannot figure out what DN meant by "PvSTP" in his paper. However, perhaps it stands for "serine/threonine phosphatase". Search for "serine/threonine" and basket everything with "serine/threonine phosphatase" (n=6). There are actually many more "serine/threonine kinases", but I'm sticking with phosphatases for now.
Next, I excluded all non-chromosomal genes (with contig names AAKM*, n=254), and filtered my VCF against these intervals with GATK's SelectVariants -XL interval.list
.
We also want to do the same for the 3D7. So exclude VARs, STEVORs, and RIFINs:
- Search for "var"
- Search for "stevor"
- Search for "rifin"
- Go through and validate each result by hand, removing any that don't belong to these families.
For both SAL1 and 3D7, we want to get this list of highly paralogous genes into the GATK .intervals
format.
In addition to paralogous gene mapping we need to mask the subtelomeric regions. Came to this conclusion from looking at some of our early Pv genotype calls that looked great everywhere except in the telomeres. For both species, I selected all the hypervariable gene families in PlasmoDB then looked at it in the "Genome View". For those telomeres with lots of paralogous family members, I just totally removed that telomere all the wan until the paralogs thinned out.