Skip to content

Major Bug Reports and Announcements

Gavin Douglas edited this page Apr 29, 2022 · 7 revisions

Below are major bug reports and announcements cross-posted from the google group in the hope that users wont miss them.

Bug report - norm_taxon_function_contrib column in stratified output malformed

Original post Date: 2022-03-23

I recently realized that there was a major bug affecting the content of the "norm_taxon_function_contrib" in the long-format (default) stratified output tables. This was due to an annoying pandas mistake that I made, and which was not picked up with small test tables in the unit tests. Although this bug will not affect how the vast majority of users use the PICRUSt2 output, it could definitely have caused issues / confusion for more advanced users.

The bug is actually quite obvious when you look at the column by eye, for instance as described here: https://github.com/picrust/picrust2/issues/206. Essentially, although the column is supposed to contain proportions, many values were > 1 and missing entirely. I apologize for not catching and appreciating that this was a bug earlier.

It is now fixed in v2.5.0, which you can see the changelog for here: https://github.com/picrust/picrust2/releases/tag/v2.5.0

My apologies to anyone whose work may have been affected by this problem.




Updated bioRxiv pre-print and major caveats highlighted

Original post
Date: 2020-03-23

An update to the PICRUSt2 pre-print on bioRxiv is now live

The major change in this revised version is that we now include a comparison of differential abundance results based on metagenome predictions compared to shotgun metagenomics data. This is a complementary approach to the standard correlation analyses we have previously used to evaluate metagenome prediction performance. This new approach recently came to our attention in the most recent Piphillin paper.

This new validation approach really helps understand how different interpretations can be based on predicted metagenomes compared to actual shotgun metagenomics data. It turns out that the interpretations can often differ substantially. For instance, the precision of calling KEGG orthologs as differentially abundance was ~0.5 across four test datasets, which means that only ~50% of KEGG orthologs called as significant in the predicted metagenome data (including by PICRUSt2) were also called as significantly different based on shotgun metagenomics data. This result highlights that standard analyses on metagenome predictions data should be interpreted very cautiously and little weight should be given to predictions for any single function (especially without any other corroborating data).

However, it is also important to clarify that the concordance in differential abundance testing for a different shotgun metagenomics workflow (mapping directly to the KEGG database rather than to UniRef) yielded only slightly better concordance with the default metagenomics output. In addition, different differential abundance statistical tests resulted in substantially different sets of significant KEGG orthologs. These results highlight the difficulty in identifying reproducible functional biomarkers with actual shotgun metagenomics data as well, which is another important challenge to acknowledge.

Lastly, we ran similar validations based on the metagenome-wide MetaCyc pathway predictions and the concordance with shotgun metagenomics data appears to be lower than at the gene family level. This analysis is more difficult to interpret and we think this finding might be related to the difficulty of reliably computing metagenome-wide pathways (i.e. pathways assuming that there is universal cross-feeding) in general. No matter the reason for the apparent poor performance based on differential abundance for the pathway predictions, it is important to acknowledge this limitation and to keep this in mind whenever this datatype is analyzed.

For both the gene family and pathway-level predictions it is important to realize that identifying significant differences in metagenome predictions between sample groupings is not evidence that the predictions are working well. If there are any differences in the relative abundances of clustered 16S rRNA gene sequences (i.e. ASVs/OTUs) then some degree of significant differences at the functional level are also expected, even if the genome predictions are randomly assigned to each clustered 16S rRNA gene sequence.


Bug report: Very poorly aligning input sequences not treated correctly prior to v2.3.0-b

Original post
Date: 2020-02-07

It's come to my attention that input sequences for PICRUSt2 that do not align at all to the reference sequences can nonetheless be placed very close to the tips of the reference tree (and so would have low nearest-sequence taxon index [NSTI] values). This was resolved in v2.3.0-b, but I just realized that if anyone had input a dataset of sequences that all matched poorly to the reference sequences then the pipeline would have finished running anyway.

This appears to only happen for sequences that don't match the 16S rRNA gene at all (e.g. 18S and ITS input sequences are similar enough apparently that this doesn't occur and have high NSTI values as expected). However, if for whatever reason your input sequences do not match the reference sequences at all then unfortunately predictions would still be output because these sequences will be placed with low NSTI values. I'm not sure why these sequences are being placed close to the tips during the placement steps, but either way they are now screened out in v2.3.0-b based on a preliminary similarity search (the "--min_align" option for the place_seqs.py script).

This bug shouldn't affect most datasets, but if you ran PICRUSt2 prior to v2.3.0-b with input sequences from a non-standard pipeline (or with sequencing data corresponding to the negative strand of the 16S rRNA gene) then you should re-run the place_seqs.py step to make sure this bug didn't affect your results.

Sorry for the inconvenience,

Gavin

Clone this wiki locally