TL;DR genotype genetic variants using genome graphs.
gramtools
builds a genome graph from a reference genome and a set of variants (VCF files or MSAs files).
Given sequence data from an individual, the graph is annotated with coverage and genotyped, producing a VCF file, and a jVCF file, of all the variation in the graph.
A personalised reference genome for the sample is also inferred and new variation can be discovered against it. You can then build a new graph from the initial and the new variants, and genotype this augmented graph.
Check out usage for details, and limitations for limitations and recommendations.
The easiest way to run gramtools
is via a container (hosted on quay.io).
To run with Docker or Singularity:
tag="latest" # or, a specific released version
# Run with docker
docker run "quay.io/iqballab/gramtools:${tag}"
# Or run with singularity
URI="docker://quay.io/iqballab/gramtools:${tag}"
singularity exec "$URI" gramtools
We recommend installing inside a virtual environment:
python -m venv venv_gramtools && source venv_gramtools/bin/activate
VERSION="1.10.0"
wget -O - "https://github.com/iqbal-lab-org/gramtools/releases/download/v${VERSION}/gramtools-${VERSION}.tar.gz" | tar xfz -
pip install ./gramtools-"${VERSION}"
gramtools contains a compiled backend component. The latest release includes a
precompiled backend for Linux. Run gramtools
to see if it is compatible with your
machine. If gramtools tells you it is not, run the following to compile the backend on
your machine:
pip install conan
bash ./gramtools-"${VERSION}"/build.sh
Please report any errors via the issue tracker.
This will always compile the binary.
git clone https://github.com/iqbal-lab-org/gramtools
pip install conan
bash ./gramtools/build.sh
pip install ./gramtools
Python >= 3.6
and pip >= 20.0.2
.
If the backend needs to be compiled, you also need CMake >= 3.1.2
and a C++17
compatible compiler: g++ >=8
(tested) or clang >=7
(untested).
For gramtools discover
to work, you additionally need to install py-cortex-api
(it
is installable via PyPi/pip) and have R
and Perl
on your $PATH.
$ gramtools -h
Usage:
gramtools [-h] [--debug] [--force] subcommand
Subcommands:
gramtools build -o GRAM_DIR --ref REFERENCE
(--vcf VCF [VCF ...] | --prgs_bed PRGS_BED | --prg PRG)
[--kmer_size KMER_SIZE] [--max_threads MAX_THREADS]
gramtools genotype -i GRAM_DIR -o GENO_DIR
--reads READS [READS ...] --sample_id SAMPLE_ID
[--ploidy {haploid,diploid}]
[--max_threads MAX_THREADS] [--seed SEED]
gramtools discover -i GENO_DIR -o DISCO_DIR
[--reads READS [READS ...]]
gramtools simulate --prg PRG
[--max_num_paths MAX_NUM_PATHS]
[--sample_id SAMPLE_ID] [--output_dir OUTPUT_DIR]
-
build - given a reference genome and VCF files or MSA files, builds an indexed graph.
For multiple-sequence alignment (MSA) files, graphs are built by our tool
make_prg
. You can provide the MSA files, or prg files you made usingmake_prg
, using option--prgs_bed
. To genotype complex regions (e.g. SNPs + SVs, or variants on multiple references), you must use this option.--prgs_bed
: builds a genome graphs of a set of regions with the rest of the linear reference genome in-between. The bed file specifies each region to build (columns 1-3) and the name of a MSA or prg file correspoding to the region in column 4. (see bed format specification) Important: currently the reference genome sequence must be the first entry in each MSA (also when usingmake_prg
yourself)--kmer_size
: used for indexing the graph in preparation forgenotype
. higherk
<=> fastergenotype
, butbuild
output will consume more disk space.--max_threads
: use this to build multiple regions in parallel when using--prgs_bed
-
genotype - map reads to a graph generated in
build
and genotype the graph. Produces genotype calls (VCF) and a personalised reference genome (fasta).--reads
: 1+ reads files in (fasta/fastq/sam/bam/cram) format--sample_id
: displayed in VCF & personalised reference outputs
-
discover - discovers new variation against the personalised reference genome from
genotype
using one or more variant callers (currently: cortex). -
simulate- samples paths through a prg, producing a fasta of the paths and a genotyped JSON of the variant bubbles the path went through.
--prg
: a prg file as output bybuild
- gramtools is primarily a genotyper, not a variant caller. Variant discovery
with the
discover
command relies on existing tools - gramtools is designed to use short, low-error rate reads only (e.g. Illumina). We recommend trimming adapters off reads (e.g. using trimmomatic) before genotyping with gramtools.
- gramtools currently performs exact matching of reads only, so relatively high read coverage (e.g. >20X) is recommended
- gramtools does not currently genotype copy-number variants (CNVs) or samples with mixed ploidy
Examples, documentation, and planned future enhancements can be found in the wiki.
For the C++ source code, doxygen formatted documentation can be generated by running
doxygen doc/Doxyfile.in
from inside the gramtools directory.
The documentation gets generated in doc/html/index.html and provides a useful reference for all files, classes, functions and data structures in gramtools.
Contributions are always welcome! Please refer to CONTRIBUTING.md. For developer tips, head to the wiki.
MIT
If you use gramtools in your work, cite as:
Letcher, B., Hunt, M. & Iqbal, Z. Gramtools enables multiscale variation analysis with genome graphs. Genome Biology 22, 259 (2021). https://doi.org/10.1186/s13059-021-02474-0
To cite the vBWT data structure:
Maciuca, S., del Ojo Elias, C., McVean, G., Iqbal, Z. A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference. WABI2016: Algorithms in Bioinformatics. Lecture Notes in Computer Science, vol 9838. Springer, Cham. https://doi.org/10.1007/978-3-319-43681-4_18
To cite make_prg
graph construction:
Colquhoun, R.M., Hall, M.B., Lima, L. et al. Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biology 22, 267 (2021). https://doi.org/10.1186/s13059-021-02473-1