Skip to content

Latest commit

 

History

History
188 lines (140 loc) · 11.3 KB

tutorial.md

File metadata and controls

188 lines (140 loc) · 11.3 KB

Qadabra tutorial

Qadabra is a Snakemake workflow for running and comparing several differential abundance (DA) methods (tools) on the same microbiome dataset.

In this tutorial, we will run Qadabra on a small dataset to guide you through its usage. This dataset is comprised of skin microbiome samples swabbed from individuals with acne and sequenced by shotgun metagenomics. A clinical trial was conducted where a topical skin cream containing a strain of Staphylococcus capitis that produces a potent antimicrobial peptide against Cutibacterium acnes (the bacteria most involved with acne) was applied on research volunteers to see if their acne improved. This tutorial will investigate how the individuals' skin microbiome changed by differential abundance analysis of microbes pre- and post-treatment from this study using Qadabra.

If you are interested in learning more about the skin microbiome in dermatological diseases, check out this review paper.

1. Installation

We recommend installing mamba to manage your Qadabra environment. (Note that mamba can interfere with existing conda environments; check this StackOverflow thread if planning to run both mamba and conda side by side.) Once mamba is installed, create and activate your Qadabra environment:

mamba create -n qadabra_env python=3.9
mamba activate qadabra_env
mamba install snakemake numpy cython

Install Qadabra and its additional dependencies using pip:

pip install qadabra click biom-format pandas iow

2. Create the workflow directory

qadabra create-workflow --workflow-dest my_qadabra
cd my_qadabra

You have now created a directory called my_qadabra with two subdirectories inside, config and workflow.

3. Create the dataset directory and download tutorial files

Create a data directory to put your input files.

mkdir data

Navigate to the tutorial_data directory in the Qadabra GitHub repo: qadabra/qadabra/test_data/. Download qadabra_tutorial_table.biom and qadabra_tutorial_metadata.tsv and move these files to your newly created data directory.

4. Add your dataset to the Qadabra workflow

In the same level as your my_qadabra directory, run the following with the add-dataset command:

qadabra add-dataset \
    --workflow-dest my_qadabra \
    --table my_qadabra/data/qadabra_tutorial_table.biom \
    --metadata my_qadabra/data/qadabra_tutorial_metadata.tsv \
    --name skin_microbiome \
    --factor-name group \
    --target-level Day_90 \
    --reference-level Baseline \
    --verbose

You can check that your dataset was added by navigating to config/datasets.tsv.

5. Running the workflow

From the command line, execute the following to start the workflow:

snakemake --use-conda --cores 4

(This took approximately 8 minutes to run on a MacBook Pro, 2.6 GHz Quad-Cores Intel Core i7 with 16GB of RAM).

6. Generating a report

After Qadabra has finished running, you can generate a Snakemake report of the workflow with the following command:

snakemake --report report.zip

This will create a zipped directory containing the report. Unzip this file and open the report.html file to view the report containing results and visualizations in your browser.

Exploring Qadabra outputs

Qadabra generates many results files and intermediate files that can be explored further.

Results files

The differential abundance results from Qadabra are outputted in terms of FDR corrected p-values and feature ranks. These results can be found in the results/<dataset_name>/ directory. Let's walkthrough the Qadabra results files:

  • concatenated_differentials.tsv: TSV table containing the differentials from each method.
  • concatenated_pvalues.tsv: TSV table containing the FDR corrected p-values from each method.
  • differentials_table.html: HTML table displaying concatenated_differentials.tsv.
  • pvalues_table.html: HTML table displaying concatenated_pvalues.tsv.
  • qadabra_all_result.tsv: TSV table containing differentials, FDR corrected p-values, and the number of methods passing significance threshold of 0.05 for each feature. (This table is used as the metadata for EMPress if a phylogenetic tree input is present.)

Each method's individual outputs are stored in a separate subdirectory under the results/<dataset_name>/methods/<method> subdirectories.

  • differentials.tsv: This file contains the differential abundance results as outputted by each individiual method.
  • differentials.processed.tsv: This file extracts just the differentials column from differentials.tsv.
  • pvalues.processed.tsv: This file extracts just the p-value column from differentials.tsv.
  • results.rds: For the R methods (all except Songbird), an RDS object with the method's R data is saved.

A Qurro visualization of all the method ranks is generated at results/<dataset_name>/qurro/index.html.

For each method, the ranked features are used for machine learning models. The results/<dataset_name>/ml subdirectory of each method contains the features used, sample log-ratios, and compressed model objects.

Results from the PCA analysis can be found under results/<dataset_name>/pca.

Figures

The generated Snakemake report contains the following folder structure:

  • Differentials

    • Comparison

      • differentials_table.html: HTML table displaying concatenated_differentials.tsv.
      Schematic - `kendall_diff_heatmap.svg`: Heatmap showing degree of concordance of differentials between methods. Schematic - `differential_pw_comparisons.html`: Interactive plot displaying pairwise correlations of differentials between any two DA methods. Schematic - `pca.svg`: PCA plot showing method-specific effects on the ranking of features. Schematic - `qurro/index.html`: Interactively explore feature ranks with Qurro. Schematic
    • UpSet plots: UpSet plots comparing the features from each method for top and bottom 20%, 15%, 10%, and 5% of features.

    Image 1 Image 2
    • Rank plots: Differential rank plots of each method. Features with a positive log ratio are more associated with target-level. Featues with a negative log ratio are more associated with reference-level.
      Image 1 Image 2
  • P-values

    • Comparison

      • pvalues_table.html: HTML table displaying concatenated_pvalues.tsv.
      Schematic
      • kendall_pvalue_heatmap.svg: Heatmap showing degree of concordance of p-values between methods.
      Schematic
      • pvalue_pw_comparisons.html: Interactive plot displaying pairwise correlations of p-values between any two p-value producing methods.
      Schematic
    • Volcano plots: Volcano plots for each p-value producing method.

      Image 1 Image 2

    • EMPress plot: An EMPress.html file to interactively explore differential abundance results with respect to phylogenetic relationships (if a tree was provided).

  • Summary - summary_figure_top.svg: Summary plot of average coefficients from each method and number of methods producing a significant p-value, for top features. Schematic

Interpretations

In this example dataset, all differential abundance methods agree by both p-values and differentials that Staphylococcus capitis is differentially abundant in the 90 Days Post-Treatment samples compared to the Baseline (not treated with S. capitis) as expected based on the design of the clinical trial. Whether the application of the S. capitis strain successfully reduced Cutibacterium acnes populations remains to be debated based on the differential abundance results. However, those tools that do report some species that are statistically decreased at Post-Treatment are mostly Cutibacterium species. The summary plot also agrees with these interpretations and provides a quick way to view the results from QADABRA.

Additional workflow options

Workflow subsetting

In some cases you may not want to run the full workflow and may only be interested in just running certain methods.

If you navigate into your my_qadabra directory, you should see two folders: config and workflow. If you open the config/config.yaml file, you can see a number of options with which to run Qadabra. You can modify these as you like to eschew certain parts of the workflow. For example, if you want to only run DESeq2, ANCOM-BC, and Songbird, you can delete the other entries in the methods heading.

Incorporating confounders

You can also specify additional confounders to incorporate into your DA model. When adding a dataset, use --confounder <column name> to add a confounder into your model. You can add multiple confounders by adding more --confounder <column name> arguments to add-dataset.

Phylogenetic visualization

Qadabra allows users to visualize the differentials and p-values on an interactive phylogenetic tree using EMPress. With EMPress, you can annotate the tree with the differentials as barplots. This can be useful for determining phylogenetic signal in differential abundance. See the EMPress GitHub page for more information and tutorial.

Issues and contributing

If you encounter any problems with Qadabra, please open a New Issue in the GitHub Issues table. Contributions are welcome and greatly appreciated. If you have any improvements or bug fixes, please follow these steps:

  1. Fork the repository.
  2. Clone the repository to your local computer: git clone <link-to-forked-repo>
  3. Create a new branch: git checkout -b feature/your-feature
  4. Make your changes and commit them: git commit -m 'Add your feature'
  5. Push the branch to your forked repository: git push origin feature/your-feature
  6. Open a pull request detailing your changes.

Please ensure that your code adheres to the existing code style and that you include appropriate tests.