This is a Snakemake workflow for running SINGER (MCMC sampling of ancestral recombination graphs) in parallel (e.g. across chunks of sequence). The genome is discretized into chunks, and SINGER is run on each chunk with parameters adjusted to account for missing sequence and recombination rate heterogeneity. Chunks are merged into a single tree sequence per chromosome and MCMC replicate. Some diagnostic plots are produced at the end, that compare summary statistics to their expectations given the ARG topology. Pair coalescence rates are calculated from the tree sequences and plotted.
Please cite SINGER if you use this pipeline (note that I'm not one of the authors of SINGER).
Using git
and mamba
and pip
:
git clone https://github.com/nspope/singer-snakemake my-singer-run && cd my-singer-run
mamba install -c bioconda snakemake
python3 -m pip install -r requirements.txt
snakemake --cores=20 --configfile=config/example_config.yaml
The input files for each chromosome are:
- chromosome_name.vcf.gz gzip'd VCF that can be used as SINGER input, either diploid and phased or haploid with an even number of samples
- chromosome_name.mask.bed (optional) bed file containing inaccessible intervals
- chromosome_name.hapmap (optional) recombination map in the format described in the documentation for
msprime.RateMap.read_hapmap
(see here) - chromosome_name.meta.csv (optional) csv containing metadata for each sample in the VCF, that will be inserted into the output tree sequences. The first row should be the field names, with subsequent rows for every sample in the VCF.
see example/*
.
A template for the configuration file is in configs/example_config.yaml
:
# --- example_config.yaml ---
input-dir: "example" # directory with input files per chromosome, that are "chrom.vcf" "chrom.hapmap" "chrom.mask.bed"
chunk-size: 1e6 # target size in base pairs for each singer run
max-missing: 0.975 # ignore chunks with more than this proportion of missing bases
mutation-rate: 1e-8 # per base per generation mutation rate
recombination-rate: 1e-8 # per base per generation recombination rate, ignored if hapmap is present
polarised: True # are variants polarised so that the reference state is ancestral
mcmc-samples: 10 # number of MCMC samples (each sample is a tree sequence)
mcmc-thin: 10 # thinning interval between MCMC samples
mcmc-burnin: 0.2 # proportion of initial samples discarded when computing plots of statistics
mcmc-resumes: 1000 # maximum number of times to try to resume MCMC on error at a given iteration
coalrate-intervals: 25 # number of time intervals to calculate coalescence rates within
stratify-by: "population" # stratify cross coalescence rates by this column in the metadata, or None
random-seed: 1 # random seed
singer-binary: "resources/singer-0.1.8-beta-linux-x86_64/singer" # TODO: automatically fetch from SINGER repo; this version is needed for -resume flag
The output files for each chromosome will be generated in results/<chromosome_name>
:
- <chromosome_name>.adjusted_mu.p :
msprime.RateMap
containing adjusted mutation rates (proportion_accessible_bases * mutation_rate
) in each chunk - <chromosome_name>.vcf.stats.p : "observed values" for summary statistics (e.g. calculated from with
scikit-allel
) - <chromosome_name>.vcf : filtered VCF used as input to SINGER
- chunks/* the raw SINGER output and logs
- plots/pair-coalescence-rates.png : pair coalescence rates (e.g. inverse of haploid Ne) within equally-spaced quantiles of the empirical distribution of pair coalescence times for all samples, with a thin line for each MCMC replicate and a thick line for the posterior mean
- plots/cross-coalescence-rates.png : pair coalescence rates within and between strata (if supplied) within equally-spaced quantiles of the empirical distribution of pair coalescence times
- plots/diversity-trace.png, plots/tajima-d-trace.png : MCMC trace for fitted nucleotide diversity and Tajima's D
- plots/diversity-scatter.png, plots/tajima-d-scatter.png : observed vs fitted nucleotide diversity and Tajima's D, across chunks
- plots/diversity-skyline.png, plots/tajima-d-skyline.png : observed and fitted nucleotide diversity and Tajima's D, across genome position
- plots/folded-afs.png, plots/unfolded-afs.png : observed vs fitted site frequency spectra
- plots/site-density.png : sanity check showing proportion of missing data, proportion variant bases (out of accessible bases), recombination rate across genome position.
- stats/<chromosome_name>.<replicate>.stats.p : "fitted values" for summary statistics (e.g. branch-mode statistics calculated with tskit) in each chunk
- stats/<chromosome_name>.<replicate>.coalrate.p : pair coalescence rates (e.g. inverse of haploid Ne) within equally-spaced quantiles of the empirical distribution of pair coalescence times, using all samples
- stats/<chromosome_name>.<replicate>.crossrate.p : cross coalescence rates within equally-spaced quantiles of the empirical distribution of pair coalescence times, between and within strata (e.g. populations) according to the
stratify-by
option in the config file - trees/<chromosome_name>.<replicate>.trees : a tree sequence MCMC replicate generated by SINGER