Qadabra is a Snakemake workflow for running and comparing several differential abundance (DA) methods (tools) on the same microbiome dataset.
In this tutorial, we will run Qadabra on a small dataset to guide you through its usage. This dataset is comprised of skin microbiome samples swabbed from individuals with acne and sequenced by shotgun metagenomics. A clinical trial was conducted where a topical skin cream containing a strain of Staphylococcus capitis that produces a potent antimicrobial peptide against Cutibacterium acnes (the bacteria most involved with acne) was applied on research volunteers to see if their acne improved. This tutorial will investigate how the individuals' skin microbiome changed by differential abundance analysis of microbes pre- and post-treatment from this study using Qadabra.
If you are interested in learning more about the skin microbiome in dermatological diseases, check out this review paper.
We recommend installing mamba to manage your Qadabra environment. (Note that mamba can interfere with existing conda environments; check this StackOverflow thread if planning to run both mamba and conda side by side.) Once mamba is installed, create and activate your Qadabra environment:
mamba create -n qadabra_env python=3.9
mamba activate qadabra_env
mamba install snakemake numpy cython
Install Qadabra and its additional dependencies using pip:
pip install qadabra click biom-format pandas iow
qadabra create-workflow --workflow-dest my_qadabra
cd my_qadabra
You have now created a directory called my_qadabra
with two subdirectories inside, config
and workflow
.
Create a data
directory to put your input files.
mkdir data
Navigate to the tutorial_data
directory in the Qadabra GitHub repo:
qadabra/qadabra/test_data/
.
Download qadabra_tutorial_table.biom
and qadabra_tutorial_metadata.tsv
and move these files to your newly created data
directory.
In the same level as your my_qadabra
directory, run the following with the add-dataset
command:
qadabra add-dataset \
--workflow-dest my_qadabra \
--table my_qadabra/data/qadabra_tutorial_table.biom \
--metadata my_qadabra/data/qadabra_tutorial_metadata.tsv \
--name skin_microbiome \
--factor-name group \
--target-level Day_90 \
--reference-level Baseline \
--verbose
You can check that your dataset was added by navigating to config/datasets.tsv
.
From the command line, execute the following to start the workflow:
snakemake --use-conda --cores 4
(This took approximately 8 minutes to run on a MacBook Pro, 2.6 GHz Quad-Cores Intel Core i7 with 16GB of RAM).
After Qadabra has finished running, you can generate a Snakemake report of the workflow with the following command:
snakemake --report report.zip
This will create a zipped directory containing the report.
Unzip this file and open the report.html
file to view the report containing results and visualizations in your browser.
Qadabra generates many results files and intermediate files that can be explored further.
The differential abundance results from Qadabra are outputted in terms of FDR corrected p-values and feature ranks.
These results can be found in the results/<dataset_name>/
directory. Let's walkthrough the Qadabra results files:
concatenated_differentials.tsv
: TSV table containing the differentials from each method.concatenated_pvalues.tsv
: TSV table containing the FDR corrected p-values from each method.differentials_table.html
: HTML table displaying concatenated_differentials.tsv.pvalues_table.html
: HTML table displaying concatenated_pvalues.tsv.qadabra_all_result.tsv
: TSV table containing differentials, FDR corrected p-values, and the number of methods passing significance threshold of 0.05 for each feature. (This table is used as the metadata for EMPress if a phylogenetic tree input is present.)
Each method's individual outputs are stored in a separate subdirectory under the results/<dataset_name>/methods/<method>
subdirectories.
differentials.tsv
: This file contains the differential abundance results as outputted by each individiual method.differentials.processed.tsv
: This file extracts just the differentials column from differentials.tsv.pvalues.processed.tsv
: This file extracts just the p-value column from differentials.tsv.results.rds
: For the R methods (all except Songbird), an RDS object with the method's R data is saved.
A Qurro visualization of all the method ranks is generated at results/<dataset_name>/qurro/index.html
.
For each method, the ranked features are used for machine learning models.
The results/<dataset_name>/ml
subdirectory of each method contains the features used, sample log-ratios, and compressed model objects.
Results from the PCA analysis can be found under results/<dataset_name>/pca
.
The generated Snakemake report contains the following folder structure:
-
Differentials
-
Comparison
differentials_table.html
: HTML table displaying concatenated_differentials.tsv.
-
UpSet plots
: UpSet plots comparing the features from each method for top and bottom 20%, 15%, 10%, and 5% of features.
Rank plots
: Differential rank plots of each method. Features with a positive log ratio are more associated withtarget-level
. Featues with a negative log ratio are more associated withreference-level
.
-
-
P-values
-
Comparison
pvalues_table.html
: HTML table displaying concatenated_pvalues.tsv.
kendall_pvalue_heatmap.svg
: Heatmap showing degree of concordance of p-values between methods.
pvalue_pw_comparisons.html
: Interactive plot displaying pairwise correlations of p-values between any two p-value producing methods.
-
Volcano plots
: Volcano plots for each p-value producing method. -
EMPress plot
: An EMPress.html file to interactively explore differential abundance results with respect to phylogenetic relationships (if a tree was provided).
-
-
Summary
-summary_figure_top.svg
: Summary plot of average coefficients from each method and number of methods producing a significant p-value, for top features.
In this example dataset, all differential abundance methods agree by both p-values and differentials that Staphylococcus capitis is differentially abundant in the 90 Days Post-Treatment samples compared to the Baseline (not treated with S. capitis) as expected based on the design of the clinical trial. Whether the application of the S. capitis strain successfully reduced Cutibacterium acnes populations remains to be debated based on the differential abundance results. However, those tools that do report some species that are statistically decreased at Post-Treatment are mostly Cutibacterium species. The summary plot also agrees with these interpretations and provides a quick way to view the results from QADABRA.
In some cases you may not want to run the full workflow and may only be interested in just running certain methods.
If you navigate into your my_qadabra
directory, you should see two folders: config
and workflow
. If you open the config/config.yaml
file, you can see a number of options with which to run Qadabra. You can modify these as you like to eschew certain parts of the workflow.
For example, if you want to only run DESeq2, ANCOM-BC, and Songbird, you can delete the other entries in the methods
heading.
You can also specify additional confounders to incorporate into your DA model.
When adding a dataset, use --confounder <column name>
to add a confounder into your model.
You can add multiple confounders by adding more --confounder <column name>
arguments to add-dataset
.
Qadabra allows users to visualize the differentials and p-values on an interactive phylogenetic tree using EMPress. With EMPress, you can annotate the tree with the differentials as barplots. This can be useful for determining phylogenetic signal in differential abundance. See the EMPress GitHub page for more information and tutorial.
If you encounter any problems with Qadabra, please open a New Issue in the GitHub Issues table. Contributions are welcome and greatly appreciated. If you have any improvements or bug fixes, please follow these steps:
- Fork the repository.
- Clone the repository to your local computer:
git clone <link-to-forked-repo>
- Create a new branch:
git checkout -b feature/your-feature
- Make your changes and commit them:
git commit -m 'Add your feature'
- Push the branch to your forked repository:
git push origin feature/your-feature
- Open a pull request detailing your changes.
Please ensure that your code adheres to the existing code style and that you include appropriate tests.