Note: If you are looking for the code to accompany this paper: Reyes-Herrera et al. (2023). Genome sequence data reveal at least two distinct incursions of the Tropical Race 4 variant of Fusarium wilt into South America. Phytopathology 113: 90–97, then please go to https://github.com/davidjstudholme/fusarium_TR4_genomics.
The purpose of this script is as part of a simple pipeline for parsing a set of SAMtools mpileup files to identify SNPs and tabulate the allele-frequencies at each SNP across a set of biological samples.
The scenario is that we have performed genomic re-sequencing on a set of several biological samples.
The genomic sequence reads have been aligned against a reference genome sequence using a tool such as BWA to generate a .bam
file.
We want to identify single-nucleotide positions in the genome that show variation between samples (i.e. SNPs) and we want to estimate the allele frequencies in each sample at each of those SNP sites.
- The samples could be individuals, for example each could be an individual plant or animal. If these individuals are diploid, then we expect that the allele frequencies will always be 0, 0.5 or 1. If the individuals are triploid then we would expect 0, 0.33, 0.67, or 1. And so on for other levels of ploidy. There is a discrete set of possible values for allele frequency.
- Alternatively, the samples could be populations rather than individuals; for example, they could be microbial cultures containing many millions of individuals of various different genotypes. In this case, the allele frequency depends on the relative abundance of each genotype within the population. There is a continuous set of possible values for allele frequency.
So, let's assume that we have one .bam
file for each sample (genomic sequence reads aligned against reference genome genome.fasta
).
The first step is to identify candidate SNPs using SAMtools version 1.6 and
BCFtools version 1.6:
for alignmentFile in *.bam; do samtools mpileup -u -f genome.fasta $alignmentFile > $alignmentFile.bcf; done
for alignmentFile in *.bam; do bcftools call -m -v -Ov $alignmentFile.bcf > $alignmentFile.vcf; done
So, now we have a set of .vcf
files containing candidate SNPs. Let's again use version 1.6 of BCFtools to filter these to keep only high-confidence ones:
for alignmentFile in *.bam; do bcftools filter --SnpGap 100 --include '(REF="A" | REF="C" | REF="G" | REF="T") & %QUAL>=35 & MIN(DP)>=5 & INDEL=0' $alignmentFile.vcf > $alignmentFile.filtered.vcf; done
This bcftools filter
step eliminates indels with low-confidence single-nucleotide variant calls.
It also eliminates candidate SNVs within 10 base pairs of an indel, since alignment artefacts are relatively common in the close vicinity of indels.
Our filtered.vcf
files now contain a catalogue of relatively high-confidence SNPs.
We also need the alignments in SAMtools' mpileup format. The SNPsFromPileups.pl
script will extract the allele frequencies from these files. These files are potentially very large and parsing every line is potentially slow. Therefore, the filtered.vcf
files are used to reduce the search space. That is, SNPsFromPileups.pl
ignores any genomic positions that are not listed in at least one of the filtered.vcf
files. (Note that we could have used the -l
option in samtools mpileup
if the filtered.vcf
were converted into .bed
format).
To generate the mpileup files (.pileup
):
for alignmentFile in *.bam; do samtools mpileup -f genome.fasta $alignmentFile > $alignmentFile.pileup; done
Now we are ready to generate the table of allele frequencies (considering only sites where read-coverage is at least 10x):
perl get_snps_from_pileups.pl 10 *.filtered.vcf *.pileup > snps.csv