Refactor to v0.3.0a1 (#30)

* Updates to workflow for multiple datasets * Fix #27 * Update GH Action * Move workflow structure * updated gh actions * update installation * Update Makefile * update songbird env * add wflow dest to qadabra cli * Change correlation to kendall * add stratification to repeated kfold * update README * new test data * Pin r-detectseparation & r-base statdivlab/corncob#141 (comment)
biocore · Sep 12, 2022 · 7d11468 · 7d11468
1 parent c5ed705
commit 7d11468
Show file tree

Hide file tree

Showing 64 changed files with 9,129 additions and 470 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -16,36 +16,35 @@ on:
       - "README.md"
 
 jobs:
-  Linting:
+  build:
     runs-on: ubuntu-latest
+
     steps:
-    - uses: actions/checkout@v2
-    - name: Lint workflow
-      uses: snakemake/[email protected]
-      with:
-        directory: .
-        snakefile: workflow/Snakefile
-        stagein: "mamba install -y -n snakemake --channel conda-forge --channel bioconda"
-        args: "--lint"
-
-  Testing:
-    runs-on: ubuntu-latest
-    needs:
-      - Linting
-    steps:
-    - uses: actions/checkout@v2
-
-    - name: Test workflow
-      uses: snakemake/snakemake-github-action@v1
-      with:
-        directory: .
-        snakefile: workflow/Snakefile
-        args: "--use-conda --show-failed-logs -j 10 --conda-cleanup-pkgs cache --conda-frontend mamba"
-        stagein: "conda config --get channel_priority --json"
-
-    - name: Test report
-      uses: snakemake/snakemake-github-action@v1
-      with:
-        directory: .
-        snakefile: workflow/Snakefile
-        args: "--report report.zip"
+      - uses: actions/checkout@v2
+        with:
+          persist-credentials: false
+          fetch-depth: 0
+
+      - uses: conda-incubator/setup-miniconda@v2
+        with:
+          activate-environment: qadabra
+          mamba-version: "*"
+          channels: conda-forge,defaults,bioconda
+          channel-priority: true
+          python-version: "3.8"
+
+      - name: Install conda packages
+        shell: bash -l {0}
+        run: mamba install snakemake click biom-format pandas numpy cython
+
+      - name: Install pip packages
+        shell: bash -l {0}
+        run: pip install iow
+
+      - name: Install qadabra
+        shell: bash -l {0}
+        run: pip install -e .
+
+      - name: Run Snakemake
+        shell: bash -l {0}
+        run: make snaketest
diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,5 @@
 *.swp
 *.snakemake
 *__pycache__
+*egg-info/
+config/datasets.tsv
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,2 @@
+graft workflow
+graft config
diff --git a/Makefile b/Makefile
@@ -1,2 +1,11 @@
-create_rulegraph:
-	snakemake -f --rulegraph | dot -Tpng > imgs/rule_graph.png
+TMPDIR := $(shell mktemp -d)
+TABLE_FILE := $(shell realpath qadabra/test_data/table.biom)
+MD_FILE := $(shell realpath qadabra/test_data/metadata.tsv)
+
+snaketest:
+	@set -e;
+	echo $(TMPDIR); \
+	qadabra create-workflow --workflow-dest $(TMPDIR); \
+	qadabra add-dataset --workflow-dest $(TMPDIR) --table $(TABLE_FILE) --metadata $(MD_FILE) --name "ampharos" --factor-name comparison --target-level CD --reference-level control --verbose; \
+	cd $(TMPDIR); \
+	snakemake --use-conda --cores 4 --retries 2
diff --git a/README.md b/README.md
@@ -1,44 +1,88 @@
 ![Main CI](https://github.com/gibsramen/qadabra/actions/workflows/main.yml/badge.svg)
 
-# qadabra
+# Qadabra
 
 **Q**uantitative **A**nalysis of **D**ifferential **Ab**undance **Ra**nks
 
 Qadabra is a Snakemake workflow for comparing the results of differential abundance tools.
 Importantly, Qadabra focuses on feature *ranks* rather than FDR corrected p-values.
 
+## Installation
+
+```
+pip install qadabra
+```
+
+Qadabra requires the following dependencies:
+
+* snakemake
+* click
+* biom-format
+* pandas
+* numpy
+* cython
+* iow
+
 ## Usage
 
-Qadabra requires both [Snakemake](https://snakemake.readthedocs.io/en/stable/) and [Snakedeploy](https://snakedeploy.readthedocs.io/en/latest/) to be installed.
+### Creating the workflow structure
+
+Qadabra can be used on multiple datasets at once.
+First, we want to create the worfklow structure to perfrom differential abundance with all tools.
 
 ```
-snakedeploy deploy-workflow https://github.com/gibsramen/qadabra qadabra_dir --tag v0.2.1
+qadabra create-workflow --workflow-dest my_qadabra
 ```
 
-This will download the workflow to your local machine.
-Enter this directory, open the `config/config.yaml` file, and replace the `table` and `metadata` entries to point to your feature table and sample metadata files.
-You should also change the model covariate, target, and reference category.
+This command will initialize the workflow but we still need to point to our dataset(s) of interest.
+
+### Adding a dataset
+
+We can add datasets one-by-one with the `add-dataset` command.
+
+```
+qadabra add-dataset \
+    --workflow-dest my_qadabra \
+    --table data/table.biom \
+    --metadata data/metadata.tsv \
+    --name my_dataset_1 \
+    --factor-name case_control \
+    --target-level case \
+    --reference-level control \
+    --verbose
+```
 
-* `covariate`: Name of the categorical metadata column to perform differential abundance
-* `target`: Level of `covariate` on which you are interested in performing differential abundance
-* `reference`: Reference category for log-fold change calculation
+Let's walkthrough the arguments provided here:
 
-If you have other confounders, you can include them under the `confounders` heading.
-Delete these if you are not including any additional confounders as follows:
+* `workflow-dest`: The location of the workflow that we created earlier
+* `table`: Feature table (features by samples) in BIOM format
+* `metadata`: Sample metadata in TSV format
+* `name`: Name to give this dataset
+* `factor-name`: Metadata column to use for differential abundance
+* `target-level`: The value in the chosen factor to use as the target
+* `reference-level`: The reference level to which we want to compare our target
+* `verbose`: Flag to show all preprocessing performed by Qadabra
 
-`confounders: `
+You can use `qadabra add-dataset --help` for more details.
+To add another dataset, just run this command again with the new dataset information.
 
-Qadabra can also output an [EMPress plot](https://journals.asm.org/doi/10.1128/mSystems.01216-20) of a phylogenetic tree annotated with each tool's differentials.
-Change the `tree` entry to point to this file.
-If you do not have a phylogenetic tree, delete this entry and EMPress will not be run.
+### Running the workflow
 
-`tree: `
+The previous commands will create a subdirectory, `my_qadabra` in which the workflow structure is contained.
+Navigate into this directory; you should see two folders: `config` and `workflow`.
+If you open the `config/config.yaml` file, you can see a number of options with which to run Qadabra.
+You can modify these as you like.
+For example, if you want to only run DESeq2, ANCOM-BC, and Songbird, you can delete the other entries in the `tools` heading.
 
-Run `snakemake --use-conda <other options>` to start the workflow.
+From the command line, execute `snakemake --use-conda <other options>` to start the workflow.
 Please read the [Snakemake documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html) for how to run Snakemake best on your system.
 
 When this process is completed, you should have directories `figures`, `results`, and `log`.
-You can also generate a report of the workflow with the following command
+Each of these directories will have a separate folder for each dataset you added.
+
+### Generating a report
+
+You can also generate a report of the workflow with the following command:
 
 ```
 snakemake --report report.zip
@@ -47,9 +91,26 @@ snakemake --report report.zip
 This will create a zipped directory containing the report.
 Unzip this file and open the `report.html` file to view the report in your browser.
 
-## Workflow Overview
+## Additional workflow options
+
+### Worfklow subset
+
+In some cases you may not want to run the full workflow and may only be interested in just running the different tools.
+You can use `snakemake all_differentials --use-conda <other options>` to eschew the machine learning and visualization parts of the workflow.
 
-![rulegraph](imgs/rule_graph.png)
+### Phylogenetic visualization
+
+Qadabra allows users to visualize the differentials on a phylogenetic tree using [EMPress](https://journals.asm.org/doi/10.1128/mSystems.01216-20).
+With EMPress, you can annotate the tree with the differentials as barplots.
+This can be useful for determining phylogenetic signal in differential abundance.
+
+### Incorporating confounders
+
+You can also specify additional confounders to incorporate into your DA model.
+When adding a dataset, use `--confounder <column name>` to add a confounder into your model.
+You can add multiple confounders by adding more `--confounder <column name>` arguments to `add-dataset`.
+
+## Workflow Overview
 
 Qadabra runs several differential abundance tools on the same dataset.
 The features are ranked according to their association with the given metadata covariate.
@@ -65,8 +126,8 @@ Qadabra generates many results files including many intermediate files that can
 Each tool's output is stored in a separate subdirectory.
 For the R tools, an RDS object with the tool's R data is saved.
 The raw outputs are processed and concatenated into a file called `concatenated_differentials.tsv`.
-A Qurro visualization of all the tool ranks is generated at `results/qurro/index.html`.
-An interactive table with all the tool outputs is at `results/differentials_table.html`.
+A Qurro visualization of all the tool ranks is generated at `results/<dataset>/qurro/index.html`.
+An interactive table with all the tool outputs is at `results/<dataset>/differentials_table.html`.
 
 For each tool, the ranked features are used for machine learning models.
 The `config.yaml` file enumerates the percentile of feats to use for log-ratios.
@@ -77,9 +138,9 @@ The `ml` subdirectory of each tool contains the features used, sample log-ratios
 #### Figures
 
 The differential rank plots of each tool are plotted as `<tool_name>_differentials.svg`.
-A heatmap of the pairwise Spearman correlation among all pairs of tools is available as well.
-We also generated interactive plots to help compare the ranks of different microbes from the tools.
-`figures/pca.svg` generates a PCA plot of all the microbes, showing the concordance and discordance of results as well as the contribution of the tools.
+A heatmap of the pairwise Kendall rank correlation among all pairs of tools is available as well.
+We also generated interactive plots to help compare the ranks of different features from the tools.
+`figures/pca.svg` generates a PCA plot of all the features, showing the concordance and discordance of results as well as the contribution of the tools.
 You can use the `figures/rank_comparisons.html` webpage to dynamically explore the relationship between pairs of tools.
 The `upset` subdirectory contains [UpSet](https://doi.org/10.1109%2FTVCG.2014.2346248) plots comparing the features from each tool.
 Finally, the `roc` and `pr` subdirectories contain ROC and PR (respectively) plots of all tools at each percentile of features.
diff --git a/data/tree.nwk b/data/tree.nwk
diff --git a/imgs/rule_graph.png b/imgs/rule_graph.png
diff --git a/qadabra/__init__.py b/qadabra/__init__.py
@@ -0,0 +1 @@
+__version__ = "0.3.0a1"
diff --git a/config/config.yaml → qadabra/config/config.yaml b/config/config.yaml → qadabra/config/config.yaml
@@ -1,14 +1,4 @@
-table: "data/table.biom"
-metadata: "data/metadata.tsv"
-tree: "data/tree.nwk"
 stylesheet: "config/qadabra.mplstyle"
-model:
-    covariate: anemia
-    target: anemic
-    reference: normal
-    confounders:
-        - sex
-        - collection_cutoff
 tools:
     - deseq2
     - ancombc
@@ -27,5 +17,5 @@ log_ratio_feat_pcts:
     - 15
     - 20
 ml_params:
-    n_splits: 10
+    n_splits: 5
     n_repeats: 5
diff --git a/config/qadabra.mplstyle → qadabra/config/qadabra.mplstyle b/config/qadabra.mplstyle → qadabra/config/qadabra.mplstyle