Table of contents:
MintTea is a method for identifying multi-omic modules of features that are both associated with a disease state and present strong associations between the different omics. It is based on sparse generalized canonical correlation analysis (sgCCA), where the disease label is encoded as an additional 'dummy' omic, as previously suggested by Gross & Tibshirani (2015)1, Singh et al. (20192, see DIABLO), and others.
For further details see: Muller, Efrat, Itamar Shiryan, and Elhanan Borenstein. "Multi-omic integration of microbiome data for identifying disease-associated modules." Nature Communications 15.1 (2024): 2621. Link
MintTea can be installed directly from GitHub, by running the following:
install.packages(devtools)
library(devtools)
install_github("efratmuller/MintTea")
library(MintTea)
-
Open an R script from which the MintTea function will be executed.
-
Organize your input data in a single data.frame object, following these guidelines:
- Rows represent samples and columns are features;
- The dataframe should include two special columns: a column holding sample identifiers and a column holding study groups ("healthy" and "disease" labels);
- Features from each omic should start with the omic-prefix (for example: 'T__' for taxonomy, 'P__' for pathways, 'M__' for metabolites, etc. Note the two consecutive underscores);
- Features in each view should be pre-processed in advance, according to common practices;
- It is highly recommended to remove rare features, and cluster highly correlated features;
-
Optionally, edit the default pipeline parameters. MintTea supports running the pipeline with multiple parameter combinations, to encourage sensitivity analysis and enable the user to check which settings generate the most informative modules. For the full list of MintTea paramaters, see:
?MintTea
. -
Pipeline results are returned as a list of multi-view modules, given for each MintTea pipeline setting requested. For each module, the following properties are returned:
Module property Details module_size
The number of features in this module. features
1st prinicipal component (PC) of each module, for each pipeline setting. module_edges
Edge weights for every pair of features in this module that co-occured in sGCCA components at least once. Edge weights are calculated as the number of times each pair co-occured in the same sGCCA component, divided by param_n_repeats
*param_n_folds
. These weights are given in case the user wants to draw the module as a network.auroc
AUROC of each module by itself, describing the module's association with the disease. Computed using its first PC and evaluated over repeated cross-validation. Note: It is warmly advised to further evaluate module-disease associations using an independent test set. shuffled_auroc
As above, but using 99 randomly sampled modules of the same size and same proprtions of views. inter_view_corr
Average correlation between features from different views. shuffled_inter_view_corr
As above, but using 99 randomly sampled modules of the same size and same proprtions of views. -
To evaluate the obtained results, we recommend starting by examining the following:
- For each pipeline setting - how many modules were found, and what are the module sizes (i.e., number of features included)?
- What was the AUC achieved by each module? (see
auroc
) - How does this AUC compare to the random-modules AUC's?
Tips:
- Optimal module sizes depend on the downstream analysis. For manual interpretation, for example, smaller modules may be favorable. If your modules came out too large, consider decreasing
param_diablo_keepX
, or decreasingparam_n_folds
, or increasingparam_edge_thresholds
. Symmetrically, if your modules are too small consider the opposite. - If the overall AUC is low, and/or all individual module AUC's are low, you may want to consider decreasing
param_diablo_design
, effectively assigning a higher importance to associations with disease as opposed to associations in-between views.
library(MintTea)
data('test_data')
minttea_results <- MintTea(test_data, view_prefixes = c('T', 'P', 'M'))
For questions about the pipeline, please open an issue (https://github.com/efratmuller/MintTea/issues) or contact Prof. Elhanan Borenstein at [email protected].
Backlog:
* Support parallel running to shorten runtimes.
* Generalize to support continuous labels.
1 Gross, Samuel M., and Robert Tibshirani. "Collaborative regression." Biostatistics 16.2 (2015): 326-338.
2 Singh, Amrit, et al. "DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays." Bioinformatics 35.17 (2019): 3055-3062.