-
Notifications
You must be signed in to change notification settings - Fork 3
depthFinder
christianparobek edited this page Nov 20, 2014
·
4 revisions
We need to determine depth over SNPs that occur in coding regions vs depth over SNPs that occur in non-coding regions. Why? Manske et al. found that in their 825 P. falciparum samples they had much greater depth over their coding SNPs vs their non-coding SNPs (~4x difference). We suspect the difference will not be as pronounced for us. However, we need to look.
grep "exon\|##" PlasmoDB-10.0_PvivaxSal1.gff | grep -v "gene" | head -n -1 > PlasmoDB-10.0_PvivaxSal1_exons.gff
# Make an exon-only GFF file (includes header)
# Grep out any gene entries that slipped in because they had "exon" in their description
# Remove the last line of the file, since it starts with '##'
grep "PASS\|^#" combined.qual.vcf > combined.pass.vcf
# combined.qual.vcf has ~250K SNP entries with filtering flags
# We need to select only the 'PASS'ing entries (and the header)
bedtools intersect -a combined.pass.vcf -b PlasmoDB-10.0_PvivaxSal1_exons.gff > combined.exons.vcf
# Get only VCF entries in combined.pass.vcf that are in exons
bedtools intersect -v -a combined.pass.vcf -b PlasmoDB-10.0_PvivaxSal1_exons.gff > combined.nonexons.vcf
# Get only VCF entries in combined.pass.vcf that are outside exons
When you add the number of lines in the exon file to the lines in the non-exon file, it sums to greater than the number of lines in the original VCF
. This is because in the GFF
file there are overlapping exons on opposite strands, and this makes a small handful of SNPs (~36) appear in the within-exon-SNP file twice.