"Coord-match" functional profiling

Woltka combines the two fundamental analyses in metagenomics: Structural profiling (mapping reads to genomes) and functional profiling (mapping reads to functional genes) into one run. This saves compute, ensures consistency, and allows for stratification which is essential for understanding functional diversity across the microbial community.

This is achieved by an efficient algorithm implemented in Woltka, which matches read alignments and annotated genes based on their coordinates on the host genome. The algorithm matches reads and genes in an ordinal scale, with computational efficiency close to linear (O(n+m), in which n and m are numbers of reads and genes), therefore suitable for handling large microbiome datasets and databases.

Overview

Running "coord-match" functional profiling is nearly identical to that of structural profiling, plus one additional parameter: --coords, which points to the coordinates of genes on host genomes, and enables the program to match reads with genes. In addition to this step, one can provide a hierarchical classification system (details). It won't be taxonomy or phylogeny, but it can be a functional catalog such as UniRef, GO, KEGG or MetaCyc.

A typical command is like (see more examples):

woltka classify \
  --input  indir \
  --coords coords.txt \
  --map    gene2func.map \
  --rank   func \
  --output func.biom

Gene coordinates

The gene coordinates (ID, start and end) are provided in a file, in either of the following two formats:

1. Native NCBI annotations and accessions:

## GCF_000005825.2
# NC_013791.2
WP_012957018.1  816 2168
WP_012957019.1  2348    3490
WP_012957020.1  3744    3959
WP_012957021.1  3971    5086
...

2. WoL reannotations (available for download), in which nucleotide sequences (chromosomes, scaffolds, or contigs) of each genome are concatenated, and a numeric index is used to identify each gene. Woltka will automatically prefix these numbers with the genome ID, such as G000006745_1.

>G000006745
1       806     372
2       2177    816
3       3896    2271
4       4446    4123
5       4629    4492

The three columns are IDs, starting and ending coordinates. The coordinates are 1-based and inclusive. For example, ID <tab> 1 <tab> 100 represents the first 100 bp of a genome. The order of start/end coordinates (i.e., strand) is arbitrary. Compressed files are supported.

With this file (and without any annotation), one can perform read-gene matching by this simple command:

woltka classify -i indir -c coords.txt -o gene.biom

Matching threshold

Woltka considers a read-to-gene match if their overlapping region reaches a percentage threshold of the entire read. This is controlled by the parameter --overlap. The default value is 80 (%).

For example, to increase discovery rate, one can do:

woltka classify -i indir -c coords.txt --overlap 50 -o gene.biom

Note however, this doesn't necessarily mean higher sensitivity and lower precision. This is because reads were aligned to the genome, NOT the genes. The alignment quality was already controlled during the alignment step. This number only stands for how much of a read happen sits inside a gene vs. the intergenic region.

Unit conversion

In a Woltka generated profile, the cell values are frequencies by default. This is a direct measurement of how much "DNA mass" are assigned to each feature (gene), and this is usually relevant in modeling microbial communities. However, other research fields may have different standards in reporting the numbers. For example, in a functional profile generated by HUMAnN, the cell values are in the unit of RPK (reads per kilobase). This unit measures how many copies of the gene are present in a sample.

To have Woltka report RPK instead of raw counts, do the following. This will divide frequencies by gene length, multiply by 1000, and keep 3 digits after the decimal point, which makes it RPK.

woltka classify -i indir -c coords.txt --sizes . --scale 1k --digits 3 -o rpk.biom

One can further convert RPK into TPM (transcripts per kilobase million) by the following command, which divides individual RPK values by the total RPK per sample, and multiplies them by 1 million.

woltka normalize -i rpk.biom --scale 1M -o tpm.biom

Note: This TPM is equivalent to HUMAnN's CPM (copies per million; not to be confused with counts per million, see HUMAnN's documentation for details). However see an important note here.

See here for more details about data normalization.

Functional catalogs

The classification system used for functional profiling can be provided in the same format(s) as the ones for structural profiling (detailed here). One thing to note is that the subject IDs should be gene IDs instead of genome ID. For examples, WP_012957018.1 or G000006745_1.

In a typical structural analysis, the classification system usually stands as a whole, hierarchical system. Examples are taxonomy and phylogeny. However, in a functional analysis, the classification system is usually a set of mapping files that "stack" up the functional units level by level.

For example, if you use the MetaCyc functional catalog, the order of "stacking" is:

ORF -> protein (family) -> enzrxn (enzymatic reaction) -> reaction -> pathway -> super pathway

You will need to provide one mapping file per level to translate lower-level units into higher-level ones. In this example, an all-in-one command is like:

woltka classify \
  --input  indir \
  --coords coords.txt \
  --map    gene-to-protein.map \
  --map    protein-to-enzrxn.map \
  --map    enzrxn-to-reaction.map \
  --map    reaction-to-pathway.map \
  --map    pathway-to-super.map \
  --rank   gene,protein,enzrxn,reaction,pathway,super \
  --output outdir

This will generate profiles at different levels all at once.

An alternative way is to only generate the gene-level profile using the classify workflow. Later, one can collapse it using different functional catalogs as needed, using Woltka's collapse command. For example:

woltka classify -i indir -c coords.txt -o gene.biom
woltka collapse -i gene.biom     -m gene-to-protein.map     -o protein.biom
woltka collapse -i protein.biom  -m protein-to-enzrxn.map   -o enzrxn.biom
woltka collapse -i enzrxn.biom   -m enzrxn-to-reaction.map  -o reaction.biom
woltka collapse -i reaction.biom -m reaction-to-pathway.map -o pathway.biom
woltka collapse -i pathway.biom  -m pathway-to-super.map    -o super.biom

In the WoL data release, there are pre-built mappings to UniRef, GO, MetaCyc, KEGG and more.

Note: For some databases, such as MetaCyc, you might encounter an error regarding AssertionError: Conflicting values found for .... This is likely because some classification units were simultaneously assigned to multiple ranks in the database, which causes conflicts in the Woltka workflow. In this case we recommend generating the gene-level profile with woltka classify, and then collapsing to individual levels one at a time with woltka collapse.

Pathway coverage

It is a frequent goal to assess the abundances of individual metabolic pathways in the microbiome. However, a pathway only makes sense when all (or an essential subset) of its member enzymes are present. Woltka can calculate the percent coverage of pathways based on the presence/absence of member genes as follows:

woltka coverage -i gene.biom -m pathway-to-genes.txt -o pathway_coverage.biom

See here for details of coverage calculation.

Stratification

It is an attractive idea to figure out which genes are present in which organisms. This can be achieved using an stratification analysis that combines both structural and function analyses. In Woltka, this can be done using two steps, like:

woltka classify -i indir -m genome-to-species.map -o species.biom --outmap species_map
woltka classify -i indir -c coords.txt -m gene-to-pathway.map --stratify species_map -o pathway_by_species.biom

In the output profile, the features will be like species_A|pathway_X. This helps with the investigation of the functional capacities of individual microbial groups.

Alternatively, you can start with just the gene table, in which the gene IDs indicate their source genomes (like "G000123456_789"), and perform collapsing on the genome AND the gene respectively.

woltka classify -i indir -c coords.txt -o gene.biom
woltka collapse -i gene.biom -f 1 -e -m genome-to-species.map -o gene_by_species.biom
woltka collapse -i gene_by_species.biom -f 2 -m gene-to-pathway.map -o pathway_by_species.biom

In this way, you only need to run the classification workflow once, and do not need to save the bulky outmap files to the disk.

See here for details of stratification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ordinal.md

ordinal.md

"Coord-match" functional profiling

Contents

Overview

Gene coordinates

Matching threshold

Unit conversion

Functional catalogs

Pathway coverage

Stratification

Files

ordinal.md

Latest commit

History

ordinal.md

File metadata and controls

"Coord-match" functional profiling

Contents

Overview

Gene coordinates

Matching threshold

Unit conversion

Functional catalogs

Pathway coverage

Stratification