+ Statement of Need
+ Understanding plant biology benefits from ecosystem-scale analysis
+ of genetic variation, and increasingly demands the characterisation of
+ not only plant genomes but also the genomes of their associated
+ microbes. Such analyses are often data intensive, particularly at the
+ scale required for quantitative analyses, i.e. hundreds to thousands
+ of samples
+ (Regalado
+ et al., 2020). They demand computationally-efficient pipelines
+ that perform both host genotyping and host-associated microbiome
+ characterisation in a consistent, flexible, and reproducible
+ fashion.
+ Currently, no such unified pipelines exist. Previous pipelines
+ perform only a subset of these tasks (e.g. Snakemake’s variant calling
+ pipeline; Köster et al.
+ (2021)).
+ In addition, most host-aware microbiome analysis pipelines do not
+ allow for genotyping and/or assume an animal host (e.g. Taxprofiler;
+ Yates et al.
+ (2023)).
+ Acanthophis has attracted many users, and has been used in
+ peer-reviewed journal articles and preprints (e.g. Murray et al.
+ (2019);
+ Ahrens et al.
+ (2021)).
+
+
+ Components and Features
+ Acanthophis is a pipeline for the analysis of plant population
+ resequencing data. It expects short-read shotgun whole (meta-)genome
+ sequencing data, typically of plants collected in the field (nothing
+ fundamentally prevents Acanthophis operating on long-read data,
+ however additional tools would need to be incorporated, which will
+ happen given sufficient user demand). A typical dataset might be
+ 10s-1000s of samples from one or multiple closely related species,
+ sequenced with 2x150bp paired-end short read sequencing. In a
+ plant-microbe interaction genomics study, these plants and therefore
+ sequencing libraries can contain microbial DNA (a “hologenome”), but
+ datasets focusing only on host genome variation are also possible.
+ Acanthophis can be configured to do any of the following analyses:
+ mapping reads to a reference, calling variants, annotating variant
+ effects, estimating genetic distances directly from sequence reads
+ (de novo), and profiling and/or assembling
+ metagenomes. While we developed Acanthophis to handle plant data,
+ there is no reason why it cannot be applied to other taxa, although
+ some parameters may need adjustment (see below). Philosophically,
+ Acanthophis aims for maximum efficiency and flexibility, and therefore
+ does not bake any particular biological question into its outputs. As
+ such, each user should for example filter the resulting variant files
+ as appropriate for their biological question(s), and likewise apply
+ other post-processing as needed.
+ Across the entire pipeline, Acanthophis operates on ‘sample sets’,
+ named groups of one or more samples, and each sample can be in any
+ number of sample sets. The pipeline is configured via a global
+ config.yaml file, in which one can configure
+ the pipeline per sample-set. This way, one can configure the analyses
+ to be run (most of the below analysis stages can be skipped if not
+ needed), as well as tool-specific settings or thresholds. We provide a
+ documented template as well as a reproducible workflow to simulate
+ test data, which can be used as a basis for customisation. While
+ Acanthophis is cross-platform, most of the underlying tools are only
+ packaged for and/or only operate on GNU/Linux operating systems.
+ Therefore, Acanthophis is only actively supported for users on Linux
+ systems.
+
+ Stage 1: Raw reads to per-sample reads
+ Input data consists of FASTQ files per run of each
+ library corresponding to a sample. For
+ each run of each library, Acanthophis uses
+ AdapterRemoval
+ (Schubert
+ et al., 2016) to remove low quality and adapter sequences,
+ and optionally to merge overlapping read pairs. It then uses
+ FastQC to summarise sequence QC before and
+ after AdapterRemoval.
+
+
+ Stage 2: Alignment to reference(s)
+ To align reads to reference genomes, Acanthophis can use any of
+ BWA MEM
+ (Li,
+ 2013), NGM
+ (Sedlazeck
+ et al., 2013), and minimap2
+ (Li,
+ 2018,
+ 2021).
+ Then, Acanthophis merges per-runlib BAMs to per-sample BAMs, and
+ uses samtools markdup
+ (Danecek
+ et al., 2021;
+ Li
+ et al., 2009) to mark duplicate reads. Input reference
+ genomes should be uncompressed,
+ samtools faidxed FASTA files.
+
+
+ Stage 3: Variant Calling
+ Acanthophis uses bcftools mpileup and/or
+ freebayes to call raw variants, using priors
+ and thresholds configurable for each sample set. It then normalises
+ variants with bcftools norm, splits
+ multi-allelic variants, filters each allele with per-sample set
+ filters, and combines filter-passing bialelic sites back into single
+ multi-allelic sites, merges region-level VCFs, indexes, and
+ calculates statistics on these final VCF files. Acanthophis provides
+ two alternative approaches to parallelise variant calling: either a
+ static list of non-overlapping genome windows (supplied in a BED
+ file), or genome bins with approximately equal amounts of data,
+ which are automatically generated using mosdepth
+ (Pedersen
+ & Quinlan, 2018).
+
+
+ Stage 4: Taxon profiling
+ Acanthophis can create taxonomic profiles of each sample with
+ reference to either public sequence databases (e.g. NCBI’s
+ nt or refseq), or
+ user-supplied databases. Acanthophis can utilise any of Kraken 2
+ (Wood
+ et al., 2019), Bracken
+ (Lu
+ et al., 2017), Kaiju
+ (Menzel
+ et al., 2016), Centrifuge
+ (Kim
+ et al., 2016), and Diamond
+ (Buchfink
+ et al., 2021) to create taxonomic profiles for each sample
+ against any number of taxon identification databases; most tools
+ supply pre-computed indices for public databases. Acanthophis can
+ then optionally use taxpasta
+ (Beber
+ et al., 2023) to merge multiple profiles into a single
+ combined table for easy downstream use.
+
+
+ Stage 5: De novo Estimates of Genetic
+ Dissimilarity
+ Acanthophis can use either kWIP
+ (Murray
+ et al., 2017) or Mash
+ (Ondov
+ et al., 2016) to estimate genetic distances between samples
+ without alignment to a reference genome. These features first count
+ reads into k-mer sketches, and then calculate pairwise distances
+ among samples.
+
+
+ Stage 6: Reporting and Statistics
+ Throughout all pipeline stages, various tools output summaries of
+ their actions and/or outputs. We optionally combine these into
+ unified reports by pipeline stage and sample set using MultiQC
+ (Ewels
+ et al., 2016), allowing plotting of raw sequence QC
+ statistics, alignment QC statistics, variant QC statistics, and
+ summarisation of taxonomic identification analyses.
+
+
+