Skip to content

Commit

Permalink
Refactor to v0.3.0a1 (#30)
Browse files Browse the repository at this point in the history
* Updates to workflow for multiple datasets

* Fix #27

* Update GH Action

* Move workflow structure

* updated gh actions

* update installation

* Update Makefile

* update songbird env

* add wflow dest to qadabra cli

* Change correlation to kendall

* add stratification to repeated kfold

* update README

* new test data

* Pin r-detectseparation & r-base

statdivlab/corncob#141 (comment)
  • Loading branch information
gibsramen authored Sep 12, 2022
1 parent c5ed705 commit 7d11468
Show file tree
Hide file tree
Showing 64 changed files with 9,129 additions and 470 deletions.
61 changes: 30 additions & 31 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,36 +16,35 @@ on:
- "README.md"

jobs:
Linting:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Lint workflow
uses: snakemake/[email protected]
with:
directory: .
snakefile: workflow/Snakefile
stagein: "mamba install -y -n snakemake --channel conda-forge --channel bioconda"
args: "--lint"

Testing:
runs-on: ubuntu-latest
needs:
- Linting
steps:
- uses: actions/checkout@v2

- name: Test workflow
uses: snakemake/snakemake-github-action@v1
with:
directory: .
snakefile: workflow/Snakefile
args: "--use-conda --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
stagein: "conda config --get channel_priority --json"

- name: Test report
uses: snakemake/snakemake-github-action@v1
with:
directory: .
snakefile: workflow/Snakefile
args: "--report report.zip"
- uses: actions/checkout@v2
with:
persist-credentials: false
fetch-depth: 0

- uses: conda-incubator/setup-miniconda@v2
with:
activate-environment: qadabra
mamba-version: "*"
channels: conda-forge,defaults,bioconda
channel-priority: true
python-version: "3.8"

- name: Install conda packages
shell: bash -l {0}
run: mamba install snakemake click biom-format pandas numpy cython

- name: Install pip packages
shell: bash -l {0}
run: pip install iow

- name: Install qadabra
shell: bash -l {0}
run: pip install -e .

- name: Run Snakemake
shell: bash -l {0}
run: make snaketest
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@
*.swp
*.snakemake
*__pycache__
*egg-info/
config/datasets.tsv
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
graft workflow
graft config
13 changes: 11 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
create_rulegraph:
snakemake -f --rulegraph | dot -Tpng > imgs/rule_graph.png
TMPDIR := $(shell mktemp -d)
TABLE_FILE := $(shell realpath qadabra/test_data/table.biom)
MD_FILE := $(shell realpath qadabra/test_data/metadata.tsv)

snaketest:
@set -e;
echo $(TMPDIR); \
qadabra create-workflow --workflow-dest $(TMPDIR); \
qadabra add-dataset --workflow-dest $(TMPDIR) --table $(TABLE_FILE) --metadata $(MD_FILE) --name "ampharos" --factor-name comparison --target-level CD --reference-level control --verbose; \
cd $(TMPDIR); \
snakemake --use-conda --cores 4 --retries 2
111 changes: 86 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,88 @@
![Main CI](https://github.com/gibsramen/qadabra/actions/workflows/main.yml/badge.svg)

# qadabra
# Qadabra

**Q**uantitative **A**nalysis of **D**ifferential **Ab**undance **Ra**nks

Qadabra is a Snakemake workflow for comparing the results of differential abundance tools.
Importantly, Qadabra focuses on feature *ranks* rather than FDR corrected p-values.

## Installation

```
pip install qadabra
```

Qadabra requires the following dependencies:

* snakemake
* click
* biom-format
* pandas
* numpy
* cython
* iow

## Usage

Qadabra requires both [Snakemake](https://snakemake.readthedocs.io/en/stable/) and [Snakedeploy](https://snakedeploy.readthedocs.io/en/latest/) to be installed.
### Creating the workflow structure

Qadabra can be used on multiple datasets at once.
First, we want to create the worfklow structure to perfrom differential abundance with all tools.

```
snakedeploy deploy-workflow https://github.com/gibsramen/qadabra qadabra_dir --tag v0.2.1
qadabra create-workflow --workflow-dest my_qadabra
```

This will download the workflow to your local machine.
Enter this directory, open the `config/config.yaml` file, and replace the `table` and `metadata` entries to point to your feature table and sample metadata files.
You should also change the model covariate, target, and reference category.
This command will initialize the workflow but we still need to point to our dataset(s) of interest.

### Adding a dataset

We can add datasets one-by-one with the `add-dataset` command.

```
qadabra add-dataset \
--workflow-dest my_qadabra \
--table data/table.biom \
--metadata data/metadata.tsv \
--name my_dataset_1 \
--factor-name case_control \
--target-level case \
--reference-level control \
--verbose
```

* `covariate`: Name of the categorical metadata column to perform differential abundance
* `target`: Level of `covariate` on which you are interested in performing differential abundance
* `reference`: Reference category for log-fold change calculation
Let's walkthrough the arguments provided here:

If you have other confounders, you can include them under the `confounders` heading.
Delete these if you are not including any additional confounders as follows:
* `workflow-dest`: The location of the workflow that we created earlier
* `table`: Feature table (features by samples) in BIOM format
* `metadata`: Sample metadata in TSV format
* `name`: Name to give this dataset
* `factor-name`: Metadata column to use for differential abundance
* `target-level`: The value in the chosen factor to use as the target
* `reference-level`: The reference level to which we want to compare our target
* `verbose`: Flag to show all preprocessing performed by Qadabra

`confounders: `
You can use `qadabra add-dataset --help` for more details.
To add another dataset, just run this command again with the new dataset information.

Qadabra can also output an [EMPress plot](https://journals.asm.org/doi/10.1128/mSystems.01216-20) of a phylogenetic tree annotated with each tool's differentials.
Change the `tree` entry to point to this file.
If you do not have a phylogenetic tree, delete this entry and EMPress will not be run.
### Running the workflow

`tree: `
The previous commands will create a subdirectory, `my_qadabra` in which the workflow structure is contained.
Navigate into this directory; you should see two folders: `config` and `workflow`.
If you open the `config/config.yaml` file, you can see a number of options with which to run Qadabra.
You can modify these as you like.
For example, if you want to only run DESeq2, ANCOM-BC, and Songbird, you can delete the other entries in the `tools` heading.

Run `snakemake --use-conda <other options>` to start the workflow.
From the command line, execute `snakemake --use-conda <other options>` to start the workflow.
Please read the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html) for how to run Snakemake best on your system.

When this process is completed, you should have directories `figures`, `results`, and `log`.
You can also generate a report of the workflow with the following command
Each of these directories will have a separate folder for each dataset you added.

### Generating a report

You can also generate a report of the workflow with the following command:

```
snakemake --report report.zip
Expand All @@ -47,9 +91,26 @@ snakemake --report report.zip
This will create a zipped directory containing the report.
Unzip this file and open the `report.html` file to view the report in your browser.

## Workflow Overview
## Additional workflow options

### Worfklow subset

In some cases you may not want to run the full workflow and may only be interested in just running the different tools.
You can use `snakemake all_differentials --use-conda <other options>` to eschew the machine learning and visualization parts of the workflow.

![rulegraph](imgs/rule_graph.png)
### Phylogenetic visualization

Qadabra allows users to visualize the differentials on a phylogenetic tree using [EMPress](https://journals.asm.org/doi/10.1128/mSystems.01216-20).
With EMPress, you can annotate the tree with the differentials as barplots.
This can be useful for determining phylogenetic signal in differential abundance.

### Incorporating confounders

You can also specify additional confounders to incorporate into your DA model.
When adding a dataset, use `--confounder <column name>` to add a confounder into your model.
You can add multiple confounders by adding more `--confounder <column name>` arguments to `add-dataset`.

## Workflow Overview

Qadabra runs several differential abundance tools on the same dataset.
The features are ranked according to their association with the given metadata covariate.
Expand All @@ -65,8 +126,8 @@ Qadabra generates many results files including many intermediate files that can
Each tool's output is stored in a separate subdirectory.
For the R tools, an RDS object with the tool's R data is saved.
The raw outputs are processed and concatenated into a file called `concatenated_differentials.tsv`.
A Qurro visualization of all the tool ranks is generated at `results/qurro/index.html`.
An interactive table with all the tool outputs is at `results/differentials_table.html`.
A Qurro visualization of all the tool ranks is generated at `results/<dataset>/qurro/index.html`.
An interactive table with all the tool outputs is at `results/<dataset>/differentials_table.html`.

For each tool, the ranked features are used for machine learning models.
The `config.yaml` file enumerates the percentile of feats to use for log-ratios.
Expand All @@ -77,9 +138,9 @@ The `ml` subdirectory of each tool contains the features used, sample log-ratios
#### Figures

The differential rank plots of each tool are plotted as `<tool_name>_differentials.svg`.
A heatmap of the pairwise Spearman correlation among all pairs of tools is available as well.
We also generated interactive plots to help compare the ranks of different microbes from the tools.
`figures/pca.svg` generates a PCA plot of all the microbes, showing the concordance and discordance of results as well as the contribution of the tools.
A heatmap of the pairwise Kendall rank correlation among all pairs of tools is available as well.
We also generated interactive plots to help compare the ranks of different features from the tools.
`figures/pca.svg` generates a PCA plot of all the features, showing the concordance and discordance of results as well as the contribution of the tools.
You can use the `figures/rank_comparisons.html` webpage to dynamically explore the relationship between pairs of tools.
The `upset` subdirectory contains [UpSet](https://doi.org/10.1109%2FTVCG.2014.2346248) plots comparing the features from each tool.
Finally, the `roc` and `pr` subdirectories contain ROC and PR (respectively) plots of all tools at each percentile of features.
1 change: 0 additions & 1 deletion data/tree.nwk

This file was deleted.

Binary file removed imgs/rule_graph.png
Binary file not shown.
1 change: 1 addition & 0 deletions qadabra/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__version__ = "0.3.0a1"
12 changes: 1 addition & 11 deletions config/config.yaml → qadabra/config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,4 @@
table: "data/table.biom"
metadata: "data/metadata.tsv"
tree: "data/tree.nwk"
stylesheet: "config/qadabra.mplstyle"
model:
covariate: anemia
target: anemic
reference: normal
confounders:
- sex
- collection_cutoff
tools:
- deseq2
- ancombc
Expand All @@ -27,5 +17,5 @@ log_ratio_feat_pcts:
- 15
- 20
ml_params:
n_splits: 10
n_splits: 5
n_repeats: 5
File renamed without changes.
Loading

0 comments on commit 7d11468

Please sign in to comment.