Running the microbiome diversity analyses has become easy thanks to tools such as qiime2. Yet, using many command lines to perform basic steps such as importing files in the right format, subsetting to remove rare species, subset samples, collapse features based on taxonomix rank or names, and for all that to run this on a High-Performance Computer (HPC), can be cumbersome. This wrapper does the same things as what is available (and not yet available) in qiita-analysis, and is what could be done more efficiently using snakemake. Yet, here's a version tailored for user having access to a HPC using either the Torque or Slurm scheduler, and where a qiime2 conda environment are installed.
Here, it allows the user to pass a folder with .tsv or .biom files corresponding to microbiome-to-sample features tables, for which another folder containing metadata for these samples is present, and run a series of analyses automatically:
- features filtering,
- fetching of the Web of Life phylogeny sub tree (features generated using the
- OGUs pipeline),
- alpha diversity analysis
- beta diversity analysis
- pcoa
- tsne
- umap
- robust-pca
- permanova
- procrustes
- mantel
- R's adonis
- nestedness (incl. NODF calulcation acrss
- dissimilarity-overlap curves
- Differential analysis using (songbird)[https://github.com/biocore/songbird]
- Co-occurrence estimation between datasets using (mmvec)[https://github.com/biocore/mmvec]
- sourcetracking
- ... (more to come)
pip install --upgrade git+https://github.com/FranckLejzerowicz/microbiome_analyzer.git
*Note that python and pip should be python3
- Please install (preferentially in a conda env):
conda install -c conda-forge scikit-bio==0.6.2 pip install numpy==2.1.2 pip install biom-format==2.1.16 pip install biom-format==2.1.16 pip install scikit-learn==1.5.2
- Xhpc: allows automatic preparation of HPC scripts from the basic qiime2 bash scripts written here. For Xpbs to work, it is necessary that the user provide edit the config.txt file of this tool (simply adding the email address for job completion as explained here).
The input is strict:
-
path to a folder containing at least two sub-folder named
data
andmetadata
(option-i
):- the
data
folder must contain one subfolder per dataset, named after the dataset as it will appear in all outputs and configs:- inside each subfolder, a table named
data.tsv
must be tab-separated with samples as columns and features as rows.
- inside each subfolder, a table named
- the
metadata
folder must also contain one subfolder per dataset (named as above):- ins
- samples are row; variables are columns; first column is for the
"
sample_name
"
- samples are row; variables are columns; first column is for the
"
- ins
- the
-
the dataset name(s) that appear in these folders must be given to option
-d
and thus, to match the features/metadata tables, e.g.workfolder ├── metadata │   ├── dataset_name_1 │   │   └── metadata.tsv │   └── dataset_name_2 │       └── metadata.tsv └── data     ├── dataset_name_1     │   └── data.tsv     └── dataset_name_2         └── data.tsv
In this case, the matching internal names are
dataset_name_1
anddataset_name_2
. Anydata
.tsv files that do not have a matchingmetadata
.tsv path will be ignored.The analysis is performed as follows:
- If both datasets are to be processed:
microbiome_analyzer run \ -i /path/to/workfolder \ -d dataset_name_1 -d dataset_name_2 \ -n jobs_name -e qiime2-2024.5
- If only the
dataset_name_2
dataset is to be processed:
microbiome_analyzer run \ -i /path/to/workfolder \ -d dataset_name_2 \ -n jobs_name -e qiime2-2019.10
In fact, the tool simply generates scripts files that need to be started manually, and which output should be scrutinized manually (highly recommended). This just a way to help you obtain the standard qiime2 command lines pre-written for Torque/Slurm and ready to be run on a HPC!
After running this command (you can try):
microbiome_analyzer run \
-i /path/to/workfolder \
-d dataset_name_1 \
-n jobs_name -e qiime2-2024.5
Inside the workfolder
you will obtain files in jobs
subfolders
(scripts to check and run), for each analysis folders named after the
<analysis_name>
(these obvisouly are the locations for outputs).
.
├── metadata
│   ├── dataset_name_1
│   │   └── metadata.tsv
│   └── dataset_name_2
│       └── metadata.tsv
├── data
│    ├── dataset_name_1
│    │   └── data.tsv
│    └── dataset_name_2
│        └── data.tsv
├── alpha
│ └── dataset_name_1
├── alpha_correlations
│ └── dataset_name_1
├── beta
│ └── dataset_name_1
├── emperor
│ └── dataset_name_1
├── pcoa
│ └── dataset_name_1
├── ...
Here's the stdout for the simple command above:
# import
sh /path/to/workfolder/import/run.sh
# taxonomy
sh /path/to/workfolder/taxonomy/run.sh
These prints are jobs to run, i.e. these sh
commands only need to be
copy-pasted on the HPC terminal to actually run the .pbs
(for Torque) or
.slm
(for Slurm, use option --slurm
) scripts within. For example, the
first .sh
file contains:
$ cat /path/to/workfolder/<analysis_name>/run.sh
mkdir -p /path/to/workfolder/<analysis_name>/jobs
cd /path/to/workfolder/<analysis_name>/jobs
sbatch /path/to/workfolder/<analysis_name>/jobs/run_0.pbs
sbatch /path/to/workfolder/<analysis_name>/jobs/run_1.pbs
sbatch /path/to/workfolder/<analysis_name>/jobs/run_2.pbs
...
- Note: If you where to prepare scripts for 5 datasets:
Then this first
microbiome_analyzer run \ -i ./microbiome_analyzer/tests/files \ -d dataset_name_1 \ -d dataset_name_2 \ -d dataset_name_3 \ -d dataset_name_4 \ -d dataset_name_5 \ -n jobs_name -e qiime2-2024.5
.sh
file would contain:$ cat /path/to/workfolder/import/run.sh mkdir -p /path/to/workfolder/import/jobs cd /path/to/workfolder/import/jobs sbatch /path/to/workfolder/import/jobs/run_dataset_name_1.pbs sbatch /path/to/workfolder/import/jobs/run_dataset_name_2.pbs sbatch /path/to/workfolder/import/jobs/run_dataset_name_3.pbs sbatch /path/to/workfolder/import/jobs/run_dataset_name_4.pbs sbatch /path/to/workfolder/import/jobs/run_dataset_name_5.pbs
- Trick here: using the option
-x <int>
(or--chunks <int>
) to group the commands into less jobs. This can be useful if you have say 50 datasets across which distance matrices you plan on doing mantel tests, this would send probably too many jobs to the scheduler. Let's see what it does to make For our 5 datasets:
Themicrobiome_analyzer run \ -i /path/to/workfolder \ -d dataset_name_1 \ -d dataset_name_2 \ -d dataset_name_3 \ -d dataset_name_4 \ -d dataset_name_5 \ -n jobs_name -e qiime2-2021.11 -x 3
.sh
file would now contain only two jobs:$ cat /path/to/workfolder/import/jbs_nm.sh mkdir -p /path/to/workfolder/import/jobs cd /path/to/workfolder/import/jobs sbatch /path/to/workfolder/import/jobs/run_0.pbs sbatch /path/to/workfolder/import/jobs/run_1.pbs sbatch /path/to/workfolder/import/jobs/run_2.pbs
- Trick here: using the option
Only import
and taxonomy
are showing up because the former one needs to be run first before most other
usual routine qiime2 analyses, whereas the latter can be run by generating fasta file from the features before
running the taxonomic assignment.
Note that this feature-to-fasta encoding into taxonomy only would work for features encoded as sequences or that correspond to the Web of Life genomes ("G#########" format). If the features are taxa or OTU names, the taxonomy will be each taxon or OTU name: in this case, please edit manually the taxonomy file (column "Taxon").
After running these two jobs, if you re-run the same, simple command above, you get:
# alpha
sh /path/to/workfolder/alpha/run.sh
# tabulate
sh /path/to/workfolder/tabulate/run.sh
# barplot
sh /path/to/workfolder/barplot/run.sh
# beta
sh /path/to/workfolder/beta/run.sh
That's more work ready to start, because now the data table was imported to qiime2:
WARNING: It is strongly recommended to check the jobs scripts first!
Whatever the analyses you will run, microbiome_analyzer
can run them on
different version of each dataset, that are all defined using yaml files
passed in command line, including:
The yaml file tells, for each dataset, which samples to remove (names
),
and which thresholds to use to remove rare features (features
) and poorly
sequenced samples (samples
), e.g.:
dataset_name_1: # dataset name
names:
- "samplename1"
- "samplename2"
samples: 100
features: 0.0001
which is interpreted as a dictionary:
{"dataset_name_1": {"names": ["samplename1", "samplename2"],
features: 0.0001, samples: 100}
The filteting proceeds in this order:
- first remove the samples in the
names
list - second remove the
samples
that do not have enough reads - third remove the
features
(from each sample) that do not have enough reads
For the thresholds, if the number is above 1, then the filtering is based on absolute reads values, but it it is between 0 and 1, then the filtering is based on percentages, and in effect:
- if
samples: 100
: all samples must have 100 reads or more - if
samples: 0.25
: all samples must have 25% of the average amount of reads per sample - of
features: 1000
: all features with less than 1000 reads in a sample will be removed from that sample. - of
features: 0.001
: all features with less than 0.1% of the reads of a sample will be removed from that sample.
Note that --raref
MUST be set for the rarefaction to happen. If no yaml file is given along
using option -r
, then the rarefaction will be done using as rarefaction depth the second percentile
of the distribution of number of reads per sample.
If the following yaml file is given to option -r
:
dataset_name_1:
- "100"
- "200"
dataset_name_2:
- min
There will be two rarefaction depths for dataset_name_1
, while dataset_name_2
will be rarefied to the depth of the minimum of the distribution of number of reads per sample.
The yaml file gives, for each dataset, the list of names or regex to find the features for which to make different subsets.
dataset_name_1:
OGUs_selection:
- "G000146185"
- "G000153885"
- "G000153905"
OGUs_ragex:
- "G000[12]0000[89]"
dataset_name_5:
Milk_dataset:
- "Milk"
This will create:
- two subsets for
dataset_name_1
:- subset
OGUs_selection
will contain three features given exactly - subset
OGUs_ragex
will contain max the four features matching the regex (i.e.,G000100008
,G000100009
,G000200009
orG000200009
)
- subset
- one subset for
dataset_name_5
:- subset
Milk_dataset
will contain all features contaiing the work "Milk".
- subset
The yaml file gives, for each dataset, a given name of taxonomic level as key, and as value, either:
- the identifier in the taxonomy for this level (e.g.
"p_Firmicutes"
will be a phylum and hence"p"
). - the index of the column to collapse to in the taxonomic table
(
1
would be for the very first, coarsest taxonomic level).
dataset_name_1:
phylum: "p"
family: "f"
genus: "g"
dataset_name_5:
level1: 1
level2: 2
level4: 4
The yaml config file must be have the following format:
sex:
- - Male
- - Female
timepoint_months:
- - '9'
- '24'
- - '24'
- '36'
income:
- - '<15000'
- - '>15000'
which is interpreted as a dictionary which for each metadata variable, lists one or more factor(s) defining a subset:
{'sex': [['Male'], ['Female']],
'timepoint_months': [['9', '24'], ['24', '36']],
'income': [['<15000'], ['>15000']]}
In this example, there will be one subset for:
- samples having
Male
in columnsex
- samples having
Female
in columnsex
- samples having
9
or24
in columntimepoint_months
- samples having
24
or36
in columntimepoint_months
- samples having a value inferior to
15000
in columnincome
- samples having a value superior to
15000
in columnincome
Note that each subset will be made on each combination of the above, i.e., for each filtering version, each feature subset and each taxonomic level.
A range of analyses are prepared systemacially, including:
- Phylogenetic sequence placement using SEPP (for 16S data, which is detected)
- Alpha diversity: by default, the metrics defined in the resource files located in
./microbiome_analyzer/resources/alpha_metrics.txt
are used, i.e.:pielou_e
,shannon
,faith_pd
andobserved_otus
(orobserved_features
in latest qiime2 versions). - Alpha correlations
- Merging of all alpha metrics (i.e., all dataset versions) to the metadata
- Alpha rarefaction
- Barplot:
- Beta diversity:
- DEICODE is a robust PCA method.
- Distance matrices generation: by default, the metrics defined in the resource files located in
./microbiome_analyzer/resources/beta_metrics.txt
are used, i.e.:jaccard
,braycurtis
,aitchison
,unweighted_unifrac
andweighted_unifrac
. - Dimensionality reduction:
- Vizualization:
- Volatility
will output in
./qiime/volatility
and must be triggered by using the option-l
, which should be a continuous (or numeric ordinal) metadata columns (usually the time points).
- Procrustes
will output in
./qiime/procrustes
under a folder defined for each datasets pair - Mantel test
will output in
./qiime/mantel
under a folder defined for each datasets pair
These two-datasets comparison methods are using the same subsetting mechanism, to run on the sample in common between datasets pairs defined in a yaml file, e.g.:
pairs:
dataset_name_1_5: # invented name
- "dataset_name_1"
- "dataset_name_5"
my_datastes_pair: # invented name
- "dataset_name_3"
- "dataset_name_4"
Note that as above, the tests are also computed for sample subsets applied to these pairs.
- PERMANOVA and PERMDISP: It is possible to run PERMANOVA for a series of user-defined subsets of the data and to test difference between different groups of each subset automatically.
This use of -t
will result in one test for each factor to the column sex
, as well as one subset for each
factor to the column age_cat
. As in this example, note that -t
can be used multiple time, once per group.
- PHATE is dimensionality reduction method that captures both local and global structures,
that will output in
./qiime/phate
- Adonis:
will output in
./qiime/adonis
It is possible to run R's Adonis in Qiime2 for a series of user-defined formulas
to test difference as in PERMANOVA for multivariate data but with continuous and
multiple metadata variables as regressors (see here).
The passed models will perform on each of the subsets defined in the file passed to option -g
, as above.
A config file must be provided in the following .yml format:
models:
dataset_name_1:
timepoint_months_PLUS_income: "timepoint_months+income"
timepoint_months_INTER_income: "timepoint_months*income"
timepoint_months_PLUS_income_INTER_sex: "timepoint_months+income*sex"
sex: "sex"
timepoint_months: "timepoint_months"
dataset_name_5:
timepoint_months: "timepoint_months"
timepoint_months_PLUS_income: "timepoint_months+income"
strata:
global: "sex"
dataset_name_1:
timepoint_months_PLUS_income:
- "sex"
- "group"
- "notacolumn"
timepoint_months_INTER_income:
- "sex"
timepoint_months_PLUS_income_INTER_sex:
- "sex"
vioscreen_foods_consumed_grams_per_day_1800s_noLiquids:
age_PLUS_height_cm:
- "sex"
In this example, there will be one model for each formula (and for each distance matrix), which in R, would correspond to these commands:
adonis(<DM> ~ timepoint_months + income, <metadata-file>)
adonis(<DM> ~ timepoint_months * income, <metadata-file>)
adonis(<DM> ~ timepoint_months + income * sex, <metadata-file>)
These two methods can "communicate" using several yaml files with different options, in order to:
- [option
-mm
] correlate the songbird differentials with the PC of the mmvec embeddings. - [option
-hlg
] highlights specific groups of featured in interactive co-occurrence heatmaps using Xmmvec.
Note that both tools can use a custom training set that can be defined based
on a metadata column, using the option -tt
. This option takes a yaml file (again!)
which is fairly simple. It tells which column name to create for each dataset, and based
on randomly selecting which proportion of each factor in each other column, e.g.:
datasets:
dataset_name_1: # a dataset name
traintest_sex: # a column to create in metadata of this dataset
- "sex" # which columns to use for (balanced) random sampling
dataset_name_5: # another dataset name
traintest_sex:
- "sex"
train: 0.7 # proportiton to randomly sample for training
These columns (here traintest_sex
) can be now used in the parameters of
songbird and mmvec!
This tool runs songbird, a QIIME2 plugin for differential abundance measure. It can take as input a number of parameters and also, is often used to run several models for several datasets. This tool eases the process by reading a single configuration file defining all datasets to use, as well as all the filtering and samples subsets to make per dataset, and the parameters.
This config file must be provided in the following .yml
format:
models:
dataset_name_1:
timeINTERsexPLUSincome: "sex+income*timepoint_months"
sexPLUSincome: "sex+income"
dataset_name_2:
sexPLUSincome: "sex+income"
baselines:
dataset_name_1:
sexPLUSincome:
sex: "sex"
income: "income"
subsets:
sex:
- - Female
- - Male
filtering:
dataset_name_1:
0.1-0.0001:
- '0.1'
- '0.0001'
params:
batches:
- 20
- 40
learns:
- 1e-4
epochs:
- 1000
- 2000
thresh_feats:
- 10
thresh_samples:
- 1000
diff_priors:
- 0.5
train:
- '0.7'
- 'traintest_sex'
The allowed "sections" are models
, baselines
, filtering
, subsets
and
params
:
-
models
: for each dataset, one (or more) model name(s) and associated model formula (which can accommodate categorical variables in formulation, see here).- In the above example, both
dataset_name_1
anddataset_name_2
will test for the model namedsexPLUSincome
by the user, which will actually use the formulasex+income
.
- In the above example, both
-
baselines
: for each dataset and for each model name defined inmodels
, one (or more) model name(s) and associated model formula (as in 'models'), but here to be run as baseline, i.e., for comparison. For example, with a config having the following:baselines: datasetName2: sexPLUSincome: sex: "sex" income: "income"
the model
sexPLUSincome
(which formula wassex+income
) will be compared fordataset_name_2
with both the results of modelsex
(which formula is simplysex
) andincome
(which formula is simplyincome
). Note that by default, the baseline is1
and thus the "section"baselines
can be missing. It is important to know that Pseudo Q2 values are only reliable to assess models compared against the same baseline (this tool will reuse the same baseline model result when assessed against multiple times, hence saving lots of computation time!) -
filtering
: for each dataset, one (or more) filtering name(s), and two filtering values:- first: the sample prevalence threshold
- second: the sample abundance threshold
In both cases, the value can be between 0 and 1, which will be interpreted as a fraction, e.g.
0.4
would mean min 40% (of samples, or the reads per sample), while a value of 1 or more will be interpreted as an absolute number, e.g.10
would mean min 10 (sample occurrences, or reads per sample).In the above example, only
dataset_name_1
will be filtered (dataset_name_2
will be used raw, or filtered using songbird params, see below), to keep only features that have at least 0.01% of the reads of each sample (0.0001
), for 10% of the samples (0.1
).dataset_Name1: 0.1-0.0001: - '0.1' - '0.0001'
(For the name, I recommend using the filtering values linked by an underscore.)
-
subsets
: the subsets are applied to all datasets (i.e., no sub-header per dataset), e.g.:subsets: sex: - - Female - - Male age_cat: - - 'baby' - 'teen' - - '30s' - '40s' - '50s' income: - - '<15000' - - '>15000'
which is interpreted as 6 different subsets:
- Females only (value of
sex
is in['Female']
- Males only (value of
sex
is in['Male']
- Young people only (value of
age_cat
is in['baby', 'teen']
- Older people only (value of
age_cat
is in['30s', '40s', 50s']
- Poor people only (value of
income
is below 15,000 - Rich people only (value of
income
is above 15,000
The outputs will have one folder per subset, which will be named
sex_Female
for the first subset, etc... - Females only (value of
-
params
: just like for "section"subsets
, the parameters are applied to all datasets (i.e., no sub-header per dataset), and the accepted parameters are:train
: can be an existing metadata variable to pass to--p-training-column
(containing onlyTrain
andTest
factors), or a number between 0 and 1 to specify the fraction of samples to randomly pick for training (default is 0.7, or 70%).batches
:--p-batch-size
learns
:--p-learning-rate
epochs
:--p-epochs
diff_priors
:--differential-prior
thresh_feats
:--min-feature-count
thresh_samples
:--min-sample-count
summary_interval
:--p-summary-interval
This "section" is where most of the combinatorial ability of this tool can be leveraged, ass you can pass more than one value per parameters: each combination of all parameters will be run, e.g.:
params: batches: - 20 - 40 train: - 0.6 - 0.7
will run 4 combinations of parameters.
Note: it you re-run microbiome_analyzer
, with any config file, it will parse
all the outputs from all configs and summarize all model performances (i.e.,
the Pseudo Q2 values) into one main table located in the qiime/songbird
output folder. This table is called songbird_q2.tsv
and it contains the
following columns:
pair
: currently only containunpaired
(paired datasets in dev...)dataset
: dataset namefilter
: filtering name (not the filtering values, so be explicit!)subset
: samples subset (e.g.sex_Female
fror the explanation above)model
: model name (not the model formula, so be explicit!)songbird_filter
: filtering in songbird (f#_s#
for feature and sample)parameters
: concatenation ofbatchsize
learnrate
epochs
diffprior
traintest
_summary_interval
baseline
: baseline name of the model (not the model formula, so be explicit!)differentials
: file path to the feature differentials (main output)Pseudo_Q_squared
: performance value after cross-validation
Note2: The output folders will contain a readme.txt
file explaining
the folder name, which can be any of the "sections" setup defined in the
config file.
It is possible to run Jamie Morton's MMVEC in Qiime2 for a series of user-defined thresholds to get filter multiple omics datasets to predict co-occurrences for (see mmvec help page). It is also possible to map the previous, Songbird differential ranks onto)
A config file must be provided in the following .yml format:
pairs:
2_3:
- dataset_name_2
- dataset_name_3*
2_4:
- dataset_name_2
- dataset_name_4
3_4:
- dataset_name_3*
- dataset_name_4
filtering:
prevalence:
- 0
- 10
abundance:
- - 0
- 0
- - 1
- 3
params:
train_column:
- 'None'
n_examples:
- 10
batches:
- 2
learns:
- 1e-4
epochs:
- 5000
priors:
- 0.1
- 1
thresh_feats:
- 0
latent_dims:
- 3
which is interpreted as a dictionary with the folowing "sections": pairs
, filtering
and params
pairs
: for each named pair of datasets, the list of two datasets.filtering
: for both aprevalence
andabundance
filter, the threshold values.params
: parameters to mmvec (see doc)
Use a *
character after the dataset name to indicate if it is a metabolomics dataset.
Usage: microbiome_analyzer [OPTIONS] COMMAND [ARGS]...
microbiome_analyzer commands
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
config Write template config files for different analyses.
run Write jobs for your pipeline configuration.
(analyzer)
Usage: microbiome_analyzer config [OPTIONS]
Write template config files for different analyses.
Options:
-i, --analysis-folder TEXT Path to the folder containing the data and
metadata sub-folders [required]
-o, --configs-folder TEXT Path to the folder to contain the config
files [required]
-d, --datasets TEXT Dataset(s) identifier(s). Multiple is
possible: e.g. -d dataset_name_1 and -d
dataset_name_2 for
'tab_dataset_name_1.tsv' and
tab_dataset_name_2.tsv' [required]
--filter / --no-filter Prepare a template configuration file for
`--filter`
--rarefy / --no-rarefy Prepare a template configuration file for
`--rarefy`
--feature-subsets / --no-feature-subsets
Prepare a template configuration file for
`--feature-subsets`
--sample-subsets / --no-sample-subsets
Prepare a template configuration file for
`--sample-subsets`
--adonis / --no-adonis Prepare a template configuration file for
`--adonis`
--nestedness / --no-nestedness Prepare a template configuration file for
`--nestedness`
--procrustes / --no-procrustes Prepare a template configuration file for
`--procrustes`
--mantel / --no-mantel Prepare a template configuration file for
`--mantel`
--dm-decay / --no-dm-decay Prepare a template configuration file for
`--dm-decay`
--geo-decay / --no-geo-decay Prepare a template configuration file for
`--geo-decay`
--collapse / --no-collapse Prepare a template configuration file for
`--collapse`
--train-test / --notrain-test- Prepare a template configuration file for
`--train-test`
--phate / --no-phate Prepare a template configuration file for
`--phate`
--sourcetracking / --no-sourcetracking
Prepare a template configuration file for
`--sourcetracking`
--doc / --no-doc Prepare a template configuration file for
`--doc`
--xmmvec / --no-xmmvec Prepare a template configuration file for
`--xmmvec`
--mmvec-highlights / --no-mmvec-highlights
Prepare a template configuration file for
`--mmvec-highlights`
--mmvec-pairs / --no-mmvec-pairs
Prepare a template configuration file for
`--mmvec-pairs`
--diff-models / --no-diff-models
Prepare a template configuration file for
`--diff-models`
--filt3d / --no-filt3d Prepare a template configuration file for
`--filt3d`
--version Show the version and exit.
--help Show this message and exit.
Usage: microbiome_analyzer run [OPTIONS]
Write jobs for your pipeline configuration.
Options:
-i, --analysis-folder TEXT Path to the folder containing the data and
metadata sub-folders [required]
-d, --datasets TEXT Dataset(s) identifier(s). Multiple is
possible: e.g. -d dataset_name_1 and -d
dataset_name_2 for
'tab_dataset_name_1.tsv' and
tab_dataset_name_2.tsv' [required]
-n, --project-name TEXT Nick name for your project [required]
-f, --filter TEXT Samples to remove, min. read abundance and
feature prevalence (>1 = based on absolute
reads, 0-1 = based on relative reads). (yaml
file) [default: False]
-r, --rarefactions TEXT rarefaction depth per dataset (yaml file)
-e, --qiime2-env TEXT name of your qiime2 conda environment (e.g.
qiime2-2021.11) [required]
-v, --sepp-tree TEXT Qiime2 SEPP reference database to use for
16S reads placement:
https://docs.qiime2.org/2019.10/data-
resources/#sepp-reference-databases (auto
detection of datasets' tables with sequences
as features)
-w, --wol-tree TEXT path to the tree containing the genome IDs
(will check if exist in features names) (On
barnacle, it is there: /projects/wol/profili
ng/dbs/wol/phylogeny/tree.nwk) [default:
resources/wol_tree.nwk]
-q, --qemistree TEXT Path to a folder containing Qemistree's
feature data (named 'feature-
data_<dataset_identifier>.qza'), and tree
for each metabolomics dataset (named
'qemistree_<dataset_identifier>.qza')
-z, --classifier TEXT Qiime2 reference taxonomic classifier
database to use for 16Sreads assignment:
https://docs.qiime2.org/2020.2/data-
resources/#taxonomy-classifiers-for-use-
with-q2-feature-classifier
-u, --run-params TEXT server run parameters
-k, --feature-subsets TEXT Regex to use for subsetting features (yml
file)
-g, --sample-subsets TEXT Subsets for DMs, PCoAs, PERMANOVAs, etc (yml
file) [default: False]
-t, --test TEXT Groups to tests between in each PERMANOVA
subset (multiple values possible, e.g. '-d
sex -d age_cat') [default: False]
-a, --adonis TEXT Formula for Adonis tests for each PERMANOVA
subset (yaml file) [default: False]
-nstd, --nestedness TEXT Nestedness analysis config (yml file)
[default: False]
-bt, --beta-type [permanova|anosim|permdisp]
Type of beta group significance
-prc, --procrustes TEXT Pairs and subsets for procrustes/protests
(yaml file) [default: False]
-mtl, --mantel TEXT Pairs and subsets for mantel test (yaml
file) [default: False]
-ddecay, --dm-decay TEXT Parameters for (not geographic) distance
decay analysis (yaml file) [default: False]
-gdecay, --geo-decay TEXT Parameters for geographic distance decay
analysis (yml file) [default: False]
-c, --collapse TEXT Nominative or rank-based taxonmic collapse
per dataset (yaml file) [default: False]
-tt, --train-test TEXT Train test split per dataset (yaml file)
[default: False]
-phate, --phate TEXT Filters, subsets, parameters and
stratifications for the PHATE latent space
analysis (yaml file) [default: False]
-st, --sourcetracking TEXT Filters, subsets, parameters and
sink/sources for sourcetracking (yaml file)
[default: False]
-doc, --doc TEXT Filters and subsets for the dissimilarity
overlap curves analyses (yaml file)
[default: False]
-s, --diff-models TEXT Formulas for multinomial regression-based
differential abundance ranking (songbird)
(yaml file) [default: False]
-m, --mmvec-pairs TEXT Pairs of datasets for which to compute co-
occurrences probabilities (mmvec) (yaml
file) [default: False]
-hlg, --mmvec-highlights TEXT Features to highlights on mmvec biplot (per
dataset) (yaml file) [default: False]
-mm, --xmmvec TEXT Config for Xmmvec (yaml file) [default:
False]
-lon, --longi-column TEXT If data is longitudinal; provide the time
metadata columnfor volatility analysis
[default: False]
-chmod, --chmod TEXT Change output files permission (default =
664 [= -rw-rw-r--])
-skip, --skip [taxonomy|barplot|wol|sepp|pies|collapse|feature_subsets|alpha|alpha_merge|alpha_rarefactions|alpha_correlations|alpha_group_significance|volatility|phate|beta|deicode|pcoa|umap|tsne|emperor|empress|biplot|emperor_biplot|empress_biplot|permanova|adonis|doc|procrustes|mantel|nestedness|dm_decay|geo_decay|sourcetracking|doc|mmvec|songbird|mmbird]
Steps to skip (e.g. if already done or not
necessary)
-As, --alphas TEXT Alpha diversity indices to use
-Bs, --betas TEXT Beta diversity metrics to use
-acc, --account TEXT Account name
--biplot / --no-biplot Whether to do the PCoA biplots or not
[default: False]
--force / --no-force Force the re-writing of scripts for all
commands(default is to not re-run if output
file exists) [default: False]
--gpu / --no-gpu Use GPUs instead of CPUs for MMVEC
[default: False]
--rarefy / --no-rarefy Whether to rarefy and only perform the
routine analyses on the rarefied dataset(s)
[default: False]
--filt-only / --no-filt-only Only process the filtered version (and not
also the raw) version of each dataset
[default: False]
-filt3d, --filt3d TEXT Levels for the exploration of filtering.
Must be a yaml file
--jobs / --no-jobs Whether to prepare Torque jobs from scripts
[default: True]
--torque / --no-torque Whether to prepare Torque and not Slurm jobs
[default: False]
-l, --localscratch INTEGER Use localscratch with the provided memory
amount (in GB)
--scratch / --no-scratch Use the scratch folder to move files and
compute [default: False]
--userscratch / --no-userscratch
Use the userscratch folder to move files and
compute [default: False]
--move-back / --no-move-back Do not move back from scratch (makes sense
only for --userscratch) [default: True]
-x, --chunks INTEGER Maximum number of jobs at which extra jobs
will be added in chunks
--version Show the version and exit.
--help Show this message and exit.
Once all data files imported and using the --force
option, you can
run the following (using the files in the tests/files
folder):
microbiome_analyzer run \
-i ./microbiome_analyzer/tests/files \
-e qiime2-2021.11 \
-n test \
-d dataset_name_1 \
-d dataset_name_5 \
-d vioscreen_foods_consumed_grams_per_day_1800s_noLiquids \
-f ./microbiome_analyzer/tests/files/filtering.yml \
-filt3d ./microbiome_analyzer/tests/files/filtering_3d.yml \
-r ./microbiome_analyzer/tests/files/rarefactions.yml \
-k ./microbiome_analyzer/tests/files/clades.yml \
-c ./microbiome_analyzer/tests/files/collapse.yml \
-g ./microbiome_analyzer/tests/files/subsets.yml \
-a ./microbiome_analyzer/tests/files/adonis.yml \
-mtl ./microbiome_analyzer/tests/files/procrustes.yml \
-prc ./microbiome_analyzer/tests/files/procrustes.yml \
-m ./microbiome_analyzer/tests/files/mmvec.yml \
-s ./microbiome_analyzer/tests/files/songbird.yml \
-ddecay ./microbiome_analyzer/tests/files/dm_decay.yml \
-phate ./microbiome_analyzer/tests/files/phate.yml \
-doc ./microbiome_analyzer/tests/files/doc.yml \
-t sex -t group \
-l timepoint_months \
--no-jobs \
--raref \
--biplot \
--force
The standard output shows you the scripts that have been written, i.e.,
pretty much all the analyses available here so far:
# import
sh /path/to/workfolder/import/run.sh
# rarefy
sh /path/to/workfolder/rarefy/run.sh
# taxonomy
sh /path/to/workfolder/taxonomy/run.sh
# import_trees
sh /path/to/workfolder/import_trees/run.sh
# subsets
sh /path/to/workfolder/subsets/run.sh
# alpha
sh /path/to/workfolder/alpha/run.sh
# alpha_correlations
sh /path/to/workfolder/alpha_correlations/run.sh
# tabulate
sh /path/to/workfolder/tabulate/run.sh
# alpha_rarefaction
sh /path/to/workfolder/alpha_rarefaction/run.sh
# volatility
sh /path/to/workfolder/volatility/run.sh
# mmvec_single_imports
sh /path/to/workfolder/mmvec_single_imports/run.sh
# mmvec_paired_imports
sh /path/to/workfolder/mmvec_paired_imports/run.sh
# phate
sh /path/to/workfolder/phate/run.sh
# barplot
sh /path/to/workfolder/barplot/run.sh
# beta
sh /path/to/workfolder/beta/run.sh
# deicode
sh /path/to/workfolder/deicode/run.sh
# pcoa
sh /path/to/workfolder/pcoa/run.sh
# tsne
sh /path/to/workfolder/tsne/run.sh
# umap
sh /path/to/workfolder/umap/run.sh
# permanova
sh /path/to/workfolder/permanova/run.sh
# adonis
sh /path/to/workfolder/adonis/run.sh
# dm_decay
sh /path/to/workfolder/dm_decay/run.sh
# dm_decay_plot
sh /path/to/workfolder/dm_decay_plot/run.sh
# doc_R
sh /path/to/workfolder/doc_R/run.sh
# songbird_imports
sh /path/to/workfolder/songbird_imports/run.sh
# songbird_filter
sh /path/to/workfolder/songbird_filter/run.sh
# songbird_baselines
sh /path/to/workfolder/songbird_baselines/run.sh
# songbird
sh /path/to/workfolder/songbird/run.sh
# qurro
sh /path/to/workfolder/qurro/run.sh
contact [email protected]