Skip to content
christianparobek edited this page Nov 20, 2014 · 4 revisions

We need to determine depth over SNPs that occur in coding regions vs depth over SNPs that occur in non-coding regions. Why? Manske et al. found that in their 825 P. falciparum samples they had much greater depth over their coding SNPs vs their non-coding SNPs (~4x difference). We suspect the difference will not be as pronounced for us. However, we need to look.

grep "exon\|##" PlasmoDB-10.0_PvivaxSal1.gff | grep -v "gene" | head -n -1 > PlasmoDB-10.0_PvivaxSal1_exons.gff
    # Make an exon-only GFF file (includes header)
    # Grep out any gene entries that slipped in because they had "exon" in their description
    # Remove the last line of the file, since it starts with '##' 
grep "PASS\|^#" combined.qual.vcf > combined.pass.vcf
    # combined.qual.vcf has ~250K SNP entries with filtering flags
    # We need to select only the 'PASS'ing entries (and the header)
bedtools intersect -a combined.pass.vcf -b PlasmoDB-10.0_PvivaxSal1_exons.gff > combined.exons.vcf
    # Get only VCF entries in combined.pass.vcf that are in exons
bedtools intersect -v -a combined.pass.vcf -b PlasmoDB-10.0_PvivaxSal1_exons.gff > combined.nonexons.vcf
    # Get only VCF entries in combined.pass.vcf that are outside exons

When you add the number of lines in the exon file to the lines in the non-exon file, it sums to greater than the number of lines in the original VCF. This is because in the GFF file there are overlapping exons on opposite strands, and this makes a small handful of SNPs (~36) appear in the within-exon-SNP file twice.

Clone this wiki locally