Skip to content

PICRUSt2‐MPGA database

Robyn Wright edited this page Jan 30, 2025 · 11 revisions

For a while now, I have been working on updating the default database used with PICRUSt2. PICRUSt2-v2.6.0 includes this new database (PICRUSt2-MPGA) and is currently available as a branch so can be installed directly from Github only. We are going to be merging this into the main branch soon. When this happens, the previous database using IMG genomes will still be present within PICRUSt2 and can still be used for functional predictions. You can find full details on the new database in our preprint here.

This page has details on how to install and run this new database, the improvements that this database has, and how it has been constructed. Aside from this database containing different genomes, one major change is that where the default PICRUSt2 database contained one phylogenetic tree that had both bacteria and archaea, this updated database contains a tree for each of bacteria and archaea. This means that some of the steps need to be run more than once and the outputs of these separate runs combined. All steps can be run together in the picrust2_pipeline.py script. If you want to run the previous database, PICRUSt2-oldIMG, you can do this using the picrust2_pipeline_oldIMG.py script.

The PICRUSt2-MPGA database

This database uses Genome Taxonomy Database (GTDB) r214 genomes. r214 of GTDB contained 402,709 in 85,205 species clusters. We annotated all 85,205 of the genomes using Eggnog v2.1.2, and 27,870 of these (26,868 bacteria and 1,002 archaea) meet the quality criteria for inclusion. This is an almost 1.4x increase in the number of genomes over the previous PICRUSt2 database, with the number of archaeal genomes more than doubling (19,493 bacteria and 406 archaea). Information on all of the included genomes can be found in the *_metadata.csv.gz files within default_files/bacteria and default_files/archaea. We use the trees that are released with GTDB for sequence insertion. The database now contains BiGG reaction, CAZy, EC numbers, gene name, GO, KO and Pfam annotations. This gives ~1.3-fold more KOs and EC numbers than the previous database. We verified the performance of the new PICRUSt2-MPGA database using simulated samples constructed using genomes that were not present in the updated database. The median weighted Nearest Sequences Taxon Index (NSTI) was lower for all datasets with the PICRUSt2-MPGA database than with the PICRUSt2-oldIMG database (average 0.069 vs 0.099), with the largest improvements being seen in the Blueberry soil, Cameroon and Primate datasets. The median Spearman's correlation coefficients are higher (0.802 vs 0.757) and Bray-Curtis dissimilarity indices are lower (0.291 vs 0.341) for the PICRUSt2-MPGA vs the PICRUSt2-oldIMG database. We will be releasing a preprint soon with more details.

All commands used for constructing the database are here.

overall_figure Figure 1. Comparison of the PICRUSt2-oldIMG and updated PICRUSt2-MPGA databases showing: (a) the steps in the construction of the PICRUSt2-MPGA database; (b) the number of functions annotated within different frameworks for the PICRUSt2-oldIMG and PICRUSt2-MPGA databases (note that not all frameworks were included in both databases); (c) the number of taxa included for each step of the database construction (top) and for each phylogenetic rank (bottom) for bacteria and archaea; (d) composition at the class level for the simulated samples (the mean relative abundance is shown for each dataset); and (e) the performance of the default and updated PICRUSt2 databases on the simulated samples from each dataset and overall (bottom). Individual points are shown for each sample with points being coloured pink for the default database and yellow for the updated database. Spearman’s correlation coefficients and Bray-Curtis dissimilarity indices shown are for KOs. Boxplots represent the median, upper and lower quartiles and whiskers show the range of the data (1.5 times the Interquartile Range) and values in boxes are medians.

Installing the new PICRUSt2-MPGA database

The latest version of PICRUSt2 now includes the PICRUSt2-MPGA database and can be installed following the instructions on the installation page.

The only additional dependency needed by PICRUSt2 now is ete3.

Using the new PICRUSt2-MPGA database

The full pipeline can be run using the picrust2_pipeline.py script like so:

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1

More details on that script are here, and details on running each of the individual steps separately are on the Workflow page.

Using the previous PICRUSt2-oldIMG database

The full pipeline with the previous PICRUSt2-oldIMG database can still be run using the picrust2_pipeline.py script like this:

picrust2_pipeline_oldIMG.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1 -db oldIMG

More details are here and details on running each of the steps individually are on the Workflow page.

Changes to the steps run by PICRUSt2

Running the steps involved in PICRUSt2 now involves a few extra steps. The steps for running the previous default database are (the links will take you to a page detailing each of the steps):

  1. Place sequences into reference tree (details)
  2. Run hidden-state prediction for 16S copy numbers, KOs and EC numbers (details)
  3. Predict KOs and EC number abundances in metagenome (details)
  4. Predict pathway abundances and coverage (details)

Because there are now two phylogenetic trees and two sets of functional trait tables, the steps for PICRUSt2 are now:

  1. Place sequences into bacterial and archaeal reference trees
    1. Place sequences into reference bacterial tree
    2. Place sequences into reference archaeal tree
  2. Run hidden state prediction for 16S copy numbers, KOs and EC numbers
    1. Run hidden-state prediction for 16S copy numbers for bacteria
    2. Run hidden-state prediction for 16S copy numbers for archaea
    3. Determine the best domain for each sequence (lowest NSTI) (details)
    4. Run hidden-state prediction for KOs and EC numbers separately for bacteria and archaea, with only the sequences that fit best with each domain
    5. Combine bacterial and archaeal predictions for each of KOs and EC numbers (details)
  3. Predict KOs and EC number abundances in metagenome (details)
  4. Predict pathway abundances and coverage
Clone this wiki locally