TL;DR: Here is a script for running the full-board WoL/Woltka workflow, including genome (OGU) and gene profilings, taxonomic classification (using NCBI taxonomy), and functional classification (using UniRef, GO, MetaCyc and optionally KEGG). You need to download the WoL data release, perform sequence alignment, customize the "Parameters" section of the script, then run it.
- wolsop.sh (click to download)
Meanwhile, a basic WoL/Woltka workflow is available from our web-based microbiome study platform: Qiita (https://qiita.ucsd.edu/) (see details).
The "Web of Life" (WoL) project aims to reconstruct an accurate reference phylogeny for microbial genomes, and to build resources that can benefit microbiome researchers. In phase I (Zhu et al., 2019), we built a reference tree of 10,575 bacterial and archaeal genomes using 381 marker genes. The basic WoL data release, which contains all necessary files for performing microbiome data analysis, is available for download from this FTP site:
wget -r -np -nH --cut-dirs=2 ftp://ftp.microbio.me/pub/wol-20April2021/
Alternatively, you may download the database from the following Dropbox site. Specifically, wolr1_seqs
is the genome sequences with which you can build a database for sequence alignment, which will happen prior to running Woltka; wolr1_maps
contains all mapping files which you will need for doing taxonomic, functional and phylogenetic analyses using Woltka.
Meanwhile, the full WoL data release, which also contains raw and processed sequence data, metadata, alternative trees, and pre-built databases for multiple metagenomics tools, are available from our Globus endpoint: WebOfLife (see instruction). The project is detailed at our website: https://biocore.github.io/wol/, including documentation, code, protocols, a gallery and a visualizer.
The following tutorial assumes that you have downloaded the basic or full WoL data release directory. The paths mentioned below are relative to this directory. Specifically, the following directories and files are relevant:
databases/bowtie2/
proteins/coords.txt.xz
taxonomy/
function/
The full WoL data release contains a pre-built Bowtie2 index which can be directly used (and you can skip this section). The basic release only contains the original genome sequences (concat.fna.xz
). You will need to build a Bowtie2 index by yourself (see here for details):
mkdir -p databases/bowtie2
xz -d -k concat.fna.xz
bowtie2-build --threads 8 concat.fna databases/bowtie2/WoLr1
rm concat.fna
With an indexed database, you can align your sequencing data (e.g., paired-end sequences R1.fq
and R2.fq
) against reference genomes using Bowtie2 (see here for details):
bowtie2 -p 8 -x db -1 R1.fq -2 R2.fq -S output.sam --very-sensitive --no-head --no-unal
This step will generate a SAM format alignment file.
Note: You can also run Bowtie2 manually using your choice of parameters, or using other aligners and corresponding databases. Woltka is designed for flexibility.
OGU (operational genomic unit) (Zhu et al., 2022) is a notion we proposed to define the minimum unit of microbiome composition allowed by shotgun metagenomic data. OGUs are reference genomes to which any input sequences have matches. This maximizes the resolution of microbiome composition, and allows for phylogeny-aware analyses using the WoL reference phylogeny. See here for details.
In Woltka, an OGU analysis is simply a classification process without a classification system. In such case, the output features will just be the subjects in the alignment file, namely reference genomes.
woltka classify -i input.sam -o ogu.biom
Woltka accepts a multiplexed alignment file, or a directory of multiple per-sample files, or a mapping of sample IDs to filepaths as input. See here for details. Woltka automatically parses compressed alignment files (such as input.sam.gz
) so you may compress them to save disk space.
The resulting file, ogu.biom
is a feature table describing the composition of the microbiome using OGUs (reference genome IDs, such as G000123456
). This file can be analyzed using standard microbiome data analysis packages such as QIIME 2, in combination with the WoL tree. Here is an example.
With a tree-structured classification system (e.g., taxonomy), Woltka will assign each query sequence to the lowest classification unit that can describe all its subjects, which can be one exact match, or multiple matches that are equally or similarily well aligned. The resulting features are not restricted to a specific taxonomic rank. We refer to it as "free-rank" classification.
Note: In the resulting feature table, features may be nested (e.g., both Escherichia and Eschierichia coli are present in the same table), but their frequencies are exclusive. Therefore, the compositionality of the table is retained. Please note this distinction from outputs of other programs such as MetaPhlAn and Kraken.
The following command performs classification on the original NCBI taxonomy (taxdump):
woltka classify \
--input input.sam \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
--output free.biom
- The
--names
parameter can be omitted if you don't need names or there are no names available. Same below.
Alternatively, one may use lineage strings extracted from NCBI (will lose some resolution, but results are more structured, especially for users familiar with QIIME 2):
woltka classify \
--input input.sam \
--lineage taxonomy/lineages.txt \
--output free.biom
We also provide original and curated NCBI and GTDB taxonomy for choice.
One may specific one or multiple classification ranks in the command, and Woltka will generate one feature table for each rank, in which all features are classification units at this rank. This method is inline with the traditional taxonomic profiling.
woltka classify \
--input input.sam \
--lineage taxonomy/lineages.txt \
--rank phylum,genus,species \
--output .
You will get three output files: phylum.biom
, genus.biom
and species.biom
.
Note: There three files are automatically generated in Qiita's Woltka workflow (see above).
Woltka implements an efficient algorithm which assigns query sequences to open reading frames (ORFs) of the reference genomes based on coordinates matching. See details. This new feature combines taxonomic and functional analyses of the microbiome in one alignment step, ensuring consistency and also improving accuracy (manuscript in preparation).
woltka classify \
--input input.sam \
--coords proteins/coords.txt.xz \
--output gene.biom
The file coords.txt.xz
defines the coordinates of ORFs on reference genomes (again, compressed files are supported). In the output feature table, each feature is an ORF ID (like G000123456_789
, meaning the 789th ORF on genome G000123456).
While informing the genetic composition of the microbiomes, this table itself may not be suitable for community analyses (because it is too sparse). But it serves as the basis for stacking up higher-level functional categories, as explained below.
Woltka provides flexible supports for various classification systems, including ones that inform the functional capacities of organisms. There are two methods for functional classification using Woltka.
Combine gene profiling and one or multiple higher-level functional profilings in one classify
command.
woltka classify \
--input input.sam \
--coords proteins/coords.txt.xz \
--map function/uniref/uniref.map.xz \
--names function/uniref/uniref.name.xz \
--map function/kegg/ko.map.xz \
--names function/kegg/ko.name \
--rank uniref,ko \
--to-tsv \
--output .
In this command, uniref.map
is a mapping of ORFs to UniRef entries. ko.map
is a mapping of UniRef entries to KEGG Ontologies (KOs). The two .name
files provide descriptive names of the UniRef and KO IDs, which will be appended to the tables.
The output files will be uniref.tsv
and ko.tsv
. The optional flag --to-tsv
instructs the program to generate tab-delimited plain text tables instead BIOM format. This applies to taxonomic profiling too. See details.
- It is useful when you (human) want to read the names of the entries. But even if the default BIOM format is used, the names will still be appended as a metadata column (not currently supported by QIIME 2 though) See details.
Starting from the gene profile (gene.biom
), sequentially stack up functional hierarchies, one at a time, using the collapse
command (see details).
The following commands generate profiles describing the metabolic capacities of the microbiome at multiple levels defined in MetaCyc (see details).
Make an abbreviation:
mcdir=function/metacyc
Collapse ORFs into MetaCyc proteins:
woltka tools collapse -i gene.biom -m $mc/protein.map.xz -n $mc/protein_name.txt -o protein.biom
- Here
-i
,-o
,-m
and-n
are abbreviations for--input
,--output
,--map
and--names
.
Collapse proteins into enzymatic reactions:
woltka tools collapse -i protein.biom -m $mc/protein-to-enzrxn.txt -n $mc/enzrxn_name.txt -o enzrxn.biom
Collapse enzymatic reactions into reactions:
woltka tools collapse -i enzrxn.biom -m $mc/enzrxn-to-reaction.txt -n $mc/reaction_name.txt -o reaction.biom
Collapse reactions to metabolic pathways:
woltka tools collapse -i reaction.biom -m $mc/reaction-to-pathway.txt -n $mc/pathway_name.txt -o pathway.biom
So on so forth. See here for a graph of all available collapsing directions.
Important: The differences between method 1 (classify
) and method 2 (collapse
) are:
classify
only supports a tree structure, in which one child unit has exactly one parent unit. This is typical in taxonomic classification. If multiple parents are present, all but the first parent will be discarded. In contrast, collapse
supports one-to-multiple mappings, therefore it is more suitable when this is the norm instead of exception, especially in functional classification (where one gene can be involved in multiple metabolic pathways).
classify
always ensures the compositionality of the feature table, in which the frequencies match the numbers of aligned sequences. collapse
however does not by default: In a one-to-multiple mapping, all parents will be counted once. But one can add -d
to the collapse
command to divide the counts by the number of parents so that the compositionality is retained. See here for a detailed discussion.
However, if the mappings are unique (one-to-one), the two methods produce mutually identical results (minor differences may arise during number rounding), and the concern of compositionality is no longer relevant.
Here is a list of currently released primary and secondary mappings and their uniqueness:
Source | Target | Unique |
---|---|---|
WoL protein | UniRef | Yes |
WoL protein | MetaCyc protein | Yes |
UniRef | KEGG ontology (KO) | No |
UniRef | Gene Ontology (GO) | No |
UniRef | OrthoDB | Yes |
UniRef | EggNOG | No |
UniRef | RefSeq | No |
Meanwhile, all name files (such as gene_name.txt
under MetaCyc) are unique.
Woltka allows combining taxonomic and functional classifications to generate stratified results. For example, you want to get the composition of KEGG ontologies of each genus:
First, run taxonomic classification at the genus level, and export read-to-genus maps:
woltka classify \
--input input.sam \
--map taxonomy/taxid.map \
--nodes taxonomy/nodes.dmp \
--names taxonomy/names.dmp \
--rank genus \
--name-as-id \
--output genus.biom \
--outmap mapdir
This command generates mappings of sequences to genera, in addition to a genus-level feature table. This information will guide the subsequent stratification.
Second, run functional annotation, with the read-to-genus maps incorporated:
woltka classify \
--input input.sam \
--coords proteins/coords.txt.xz \
--map function/uniref/uniref.map.xz \
--map function/kegg/ko.map.xz \
--rank ko \
--stratify mapdir \
--output ko_by_genus.biom
In the output table, features will be like Escherichia|K00133
, Salmonella|K00604
, etc. See here for mode details about stratification.
With a stratified taxonomic/functional profile, one may still perform further collapsing using one of the two systems. For example:
woltka tools collapse -f 2 -i ko_by_genus.biom -m ko/ko-to-go.txt -o go_by_genus.biom
The parameter -f 2
instructs the program to collapse by the second field of each feature (e.g., K00133
of Escherichia|K00133
). See here for details.