-
MLOmics: Machine Learning Cancer Multi-Omics Benchmark with Datasets, Tasks, and Baselines
Machine learning has shown great potential in the field of cancer multi-omics studies, offering incredible opportunities for advancing precision medicine. However, the challenges associated with dataset curation and task formulation pose significant hurdles, especially for researchers lacking a biomedical background. Here, we introduce the MLOmics, the first large-scale cancer multi-omics benchmark that integrates the TCGA platform, making data resources accessible and usable for machine learning researchers without significant preparation and expertise. To date, MLOmics includes a collection of 20 cancer multi-omics datasets covering 32 cancers, accompanied by a systematic data processing pipeline. MLOmics provides well-processed dataset versions to support 20 meaningful tasks in four studies, with a collection of benchmarks. We also integrate MLOmics with two complementary resources and various biological tools to explore broader research avenues. All resources are open-accessible with user-friendly and compatible integration scripts that enable non-experts to easily incorporate this complementary information for various tasks. We conduct extensive experiments on selected datasets to offer recommendations on suitable machine learning baselines for specific applications. Through MLOmics, we aim to facilitate algorithmic advances and hasten the development, validation, and clinical translation of machine-learning models for personalized cancer treatments.
$ git clone https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark
$ cd Cancer-Multi-Omics-Benchmark
# Set up the environment
conda create -n mlomics python=3.9
conda activate
pip install -r requirements.txt
$ ./download.sh
MLOmics/
├── Main_Dataset # Main Datasets
├── Baseline_and_Metric/ # Baseline & Metrics
│ └── Tasks/
│ ├── Baselines/
│ │ ├── R/ # Traditional ML models (.r files)
│ │ └── Python/ # Deep learning models (.py files)
│ └── Metrics/
│ └── task_metrics.py # Evaluation metrics for each task
├── Dwonstream_Analysis_Tools_and_resources/
│ ├── Knowledge_bases/ # Biological knowledge bases
│ │ ├── STRING_mapping.csv # STRING database mapping
│ │ └── KEGG_mapping.csv # KEGG pathway mapping
│ ├── Clinical_annotation/ # Patient clinical data
│ │ └── clinical_record.csv
│ └── Analysis_tools/ # Analysis scripts
│ └── Analysis_Tools_and_Resources.py
└── Scripts/ # Quick start scripts
├── Tasks/
└── Dwonstream_Analysis
Each dataset is available in three feature versions:
- Original: Full feature set
- Aligned: Intersection of features across cancer types
- Top: Most significant features selected via ANOVA
MLOmics provides a standardized interface to run all baseline models:
$ ./<baseline_model>.sh <dataset> <version> [options]
Where:
- <baseline_model>: Name of the model script (e.g., GRAPE.sh, Subtype-GAN.sh)
- : Target dataset name (e.g., GS-BRCA, ACC)
- : feature version name (e.g., Original, Aligned, Top)
- [options]: Optional parameters like missing rate (e.g., 0.3)
Classification Tasks:
# Run classification with DeepCC model on GS-BRCA Original data
$ cd Scripts/Classification
$ ./DeepCC.sh GS-BRCA original
Clustering Tasks:
# Run clustering with Subtype-GAN model on ACC Top data
$ cd Scripts/Clustering
$ ./Subtype-GAN.sh ACC Top
Imputation Tasks:
# Run imputation with GAIN model on 30% missing ACC Top data
$ cd Scripts/Imputation
$ ./GAIN.sh GS-BRCA Top 0.3
MLOmics provides comprehensive tools for biological interpretation of machine learning results, primarily focused on differential expression analysis and pathway enrichment.
KEGG pathway analysis:
$ cd Scripts/Dwonstream_Analysis
$ ./pwanalysis.sh <clustering_log_path> [options]
--p_value_cutoff 0.05 # Significance threshold for genes
Generate volcano plot:
$ cd Scripts/Dwonstream_Analysis
$ ./volcano.sh <clustering_log_path> [options]
--p_value_threshold 0.05 # P-value significance threshold
MLOmics provides a collection of 20 multi-omics datasets including:
- One pan-cancer dataset involving patients with 32 cancer types.
- Nine unlabeled cancer subtype datasets including Adrenocortical Carcinoma (ACC), Kidney Renal Papillary Cell Carcinoma (KIRP), Kidney Renal Clear Cell Carcinoma (KIRC), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), Prostate Adenocarcinoma (PRAD), Thyroid Carcinoma (THCA), and Thymoma (THYM).
- Five labeled, golden-standard subtype datasets corresponding to five cancers: Colon Adenocarcinoma (GS-COAD), Breast invasive carcinoma (GS-BRCA), Glioblastoma Multiforme (GS-GBM), Brain Lower Grade Glioma (GS-LGG), and Ovarian Serous Cystadenocarcinoma (GS-OV).
- Five TCGA data imputation datasets include corrupted omics profiles from the above well-studied cancer types involving Imp-COAD, Imp-BRCA, Imp-GBM, Imp-LGG, and Imp-OV.
- Two complementary data resources include a collected corpus from STRING and a collection of Electronic Health Records (EHR) data for cancer samples, accompanied by interactive scripts for integration.
Cancer multi-omics analysis always suffers from an unbalanced sample and feature size. MLOmics hence provides three versions of feature scales, i.e., Original, Top, and Aligned, to support feasible analysis.
- Original features are extracted directly from each dataset and correspond to the complete set of features without filtering. Users can customize their datasets.
- Top features are identified through ANOVA statistical testing according to p-values, selecting the most significant features among samples. This approach unifies the feature size and potentially reduces the noise features.
- Aligned features are determined by the intersection of features present across all sub-datasets, corresponding to the shared features among different sub-datasets.
MLOmics provides a collection of 20 multi-omics datasets covering 32 cancer types, that is, one Pan-cancer dataset, nine unlabeled cancer subtype datasets, five labeled, golden-standard subtype datasets, and five TCGA data imputation datasets.
Dataset | Feature Scale | mRNA | miRNA | Methy | CNV | Sample Size | #Baselines | Learning Task |
---|---|---|---|---|---|---|---|---|
ACC | Original | 18034 | 368 | 19045 | 19525 | 177 | 10 | Clustering |
KIRP | Original | 18465 | 769 | 18715 | 19551 | 273 | 10 | Clustering |
KIRC | Original | 18464 | 352 | 19045 | 19523 | 314 | 10 | Clustering |
LIHC | Original | 17946 | 846 | 18714 | 19551 | 364 | 10 | Clustering |
LUAD | Original | 18310 | 427 | 19052 | 19551 | 450 | 10 | Clustering |
LUSC | Original | 18206 | 423 | 19060 | 19551 | 363 | 10 | Clustering |
PRAD | Original | 17954 | 759 | 19049 | 19568 | 450 | 10 | Clustering |
THCA | Original | 17261 | 375 | 19052 | 19551 | 291 | 10 | Clustering |
THYM | Original | 18354 | 1018 | 18716 | 19551 | 119 | 10 | Clustering |
Pan-cancer | Aligned | 3217 | 383 | 3139 | 3105 | 8314 | 10 | Classification |
GS-COAD | Original | 17261 | 375 | 19052 | 19551 | 260 | 10 | Classification |
GS-BRCA | Original | 18206 | 368 | 19049 | 19568 | 671 | 10 | Classification |
GS-GBM | Original | 20684 | 335 | 19034 | 19545 | 243 | 10 | Classification |
GS-LGG | Original | 18345 | 345 | 19023 | 19534 | 246 | 10 | Classification |
GS-OV | Original | 17354 | 244 | 19034 | 19534 | 284 | 10 | Classification |
Imp-COAD | Top | 2000 | 200 | 2000 | 2000 | 260 | 7 | Imputation |
Imp-BRCA | Top | 2000 | 200 | 2000 | 2000 | 671 | 7 | Imputation |
Imp-GBM | Top | 2000 | 200 | 2000 | 2000 | 243 | 7 | Imputation |
Imp-LGG | Top | 2000 | 200 | 2000 | 2000 | 246 | 7 | Imputation |
Imp-OV | Top | 2000 | 200 | 2000 | 2000 | 284 | 7 | Imputation |
MLOmics currently provides 20 learning tasks in three studies, including pan-cancer classification, cancer subtype identification, and omics data imputation, each with a corresponding dataset version, baseline methods, and evaluation metrics.
Motivation:
This task aims to identify the specific cancer type for each patient, enhancing early diagnostic accuracy and potentially improving treatment outcomes.
Baseline Methods:
Several computational multi-omics data integration methods have been proposed for cancer identification using classical statistical machine learning and deep-based methods. Currently, we have enrolled well-used, open-sourced statistical methods, including:
- Similarity Network Fusion (SNF) [1]: Integrates omics data by iteratively refining sample similarity networks and applying spectral clustering.
- Neighborhood-based Multi-Omics clustering (NEMO) [2]: Converts sample similarity networks to relative similarity for group comparability.
- Cancer Integration via Multi-kernel Learning (CIMLR) [3]: Combines various Gaussian kernels into a similarity matrix for clustering.
- iClusterBayes [4]: Projects input into a low-dimensional space using the Bayesian latent variable regression model for clustering.
- moCluster [5]: Uses multiple multivariate analyses to calculate latent variables for classification.
- Subtype-GAN [6] : Extracts features from each omics data by relatively independent GAN layers and integrates them.
- DCAP [7] : Integrates multi-omics data by the denoising autoencoder to obtain the representative features.
- MAUI [8] : Uses stacked VAE to extract many latent factors to identify patient groups.
- XOmiVAE [9] : Uses VAE for low-dimensional latent space extraction and classification.
- MCluster-VAEs [10] : Uses VAE with an attention mechanism to model multi-omics data.
Evaluation Metrics:
Referring to related literature, we propose precision (PREC), normalized mutual information (NMI), and adjusted rand index (ARI) to evaluate the degree of agreement between the subtyping results obtained by different methods and the true labels.
Task | #Baselines | Metrics |
---|---|---|
Pan-cancer Classification | 10 | PREC, NMI, ARI |
Motivation:
Each specific cancer comprises multiple subtypes. Cancer clustering or classification aims to categorize patients into subgroups based on their multi-omics data. The reason is that while the subtypes may differ in their biochemical levels, they often share the same morphological traits, such as physical structure and form in an organism. However, for most cancer types, subtyping a cancer is still an open question under discussion. Thus, cancer subtyping tasks are typically clustering tasks without ground true labels. Here, the cancer research community has thoroughly analyzed the subtypes of some of the most common cancer types in a previous study. Therefore, we consider these subtypes to contain the true labels and set up a classification task for these subtypes.
Baseline Methods:
Since most methods do not have a specific application for labeled or unlabeled datasets, they can serve as baselines across both types of tasks. We use the same baselines (i.e., SNF, NEMO, CIMLR, iClusterBayes, moCluster, Subtype-GAN, DCAP, MAUI, XOmiVAE, and MCluster-VAEs) as in pan-cancer classification tasks.
Evaluation Metrics:
For subtype clustering, we evaluate the baseline results using the silhouette coefficient (SIL) and log-rank test p-value on survival time (LPS). For the golden-standard subtype classification, we also use the metrics of PREC, NMI, and ARI.
Task | #Baselines | Metrics |
---|---|---|
Cancer Subtype Clustering | 10 | SIL, LPS |
Golden-standard Subtype Classification | 10 | PREC, NMI, ARI |
Motivation:
We also set up an essential learning task focused on omics data. The collected omics data are typically unified with several missing values due to experimental limitations, technical errors, or inherent variability. The imputation process is crucial for ensuring the integrity and usability of TCGA omics data.
Baseline Methods:
There are several well-used methods for imputing missing values in datasets. Currently, we enrolled six of them, including:
- Mean imputation (Mean) [11]: Imputes missing values using the mean of all observed values for the same feature.
- K-Nearest Neighbors (KNN) [12]: Imputes missing values using the K-nearest neighbors with observed values in the same feature. The weights are based on the Euclidean distance to the sample.
- Multivariate imputation by chained equations (MICE) [13]: Runs multiple regressions where each missing value is modeled based on the observed non-missing values.
- Iterative SVD (SVD) [14]: Uses matrix completion with iterative low-rank SVD decomposition to impute missing values.
- Spectral regularization algorithm (Spectral) [15]: A matrix completion model that uses the nuclear norm as a regularizer and imputes missing values with iterative soft-thresholded SVD.
- Graph neural network for tabular data (GRAPE) [16]: Transforms rows and columns of tabular data into two types of nodes in the graph structure. Then, it uses a graph neural network to learn node representations and turns the imputation task into a missing edge prediction task on the graph.
- Generative Adversarial Imputation Nets (GAIN) [17]: Imputes missing data by leveraging the adversarial process to learn the underlying distribution.
Evaluation Metrics:
We use metrics including mean absolute error (MAE) and root mean squared error (RMSE), which are commonly used to assess imputation quality.
Task | #Baselines | Metrics |
---|---|---|
Omics Data Imputation | 7 | MAE, RMSE |
We summarize the performances of nine baseline cancer patient classification methods and several imputation methods across various datasets and missing rates.
We tested nine baseline cancer patient classification methods on four patient classification datasets. The results are reported as PREC, NMI, and ARI.
Method | Pan-cancer PREC | Pan-cancer NMI | Pan-cancer ARI | GS-BRCA PREC | GS-BRCA NMI | GS-BRCA ARI | GS-COAD PREC | GS-COAD NMI | GS-COAD ARI | GS-GBM PREC | GS-GBM NMI | GS-GBM ARI |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SNF | 0.643 | 0.543 | 0.475 | 0.644 | 0.523 | 0.426 | 0.625 | 0.534 | 0.432 | 0.625 | 0.544 | 0.470 |
NEMO | 0.656 | 0.464 | 0.356 | 0.542 | 0.444 | 0.333 | 0.644 | 0.454 | 0.333 | 0.634 | 0.406 | 0.316 |
CIMLR | 0.665 | 0.365 | 0.344 | 0.655 | 0.332 | 0.345 | 0.631 | 0.343 | 0.344 | 0.647 | 0.344 | 0.323 |
iClusterBayes | 0.747 | 0.534 | 0.433 | 0.646 | 0.524 | 0.428 | 0.637 | 0.582 | 0.434 | 0.662 | 0.506 | 0.432 |
moCluster | 0.725 | 0.553 | 0.557 | 0.636 | 0.630 | 0.655 | 0.749 | 0.546 | 0.652 | 0.755 | 0.734 | 0.564 |
Subtype-GAN | 0.844 | 0.774 | 0.748 | 0.873 | 0.734 | 0.643 | 0.851 | 0.685 | 0.648 | 0.837 | 0.625 | 0.640 |
DCAP | 0.845 | 0.745 | 0.636 | 0.852 | 0.743 | 0.733 | 0.852 | 0.667 | 0.655 | 0.825 | 0.642 | 0.522 |
MAUI | 0.859 | 0.758 | 0.625 | 0.844 | 0.792 | 0.742 | 0.882 | 0.635 | 0.696 | 0.874 | 0.741 | 0.691 |
XOmiVAE | 0.894 | 0.795 | 0.774 | 0.843 | 0.753 | 0.761 | 0.923 | 0.752 | 0.732 | 0.946 | 0.791 | 0.737 |
MCluster-VAEs | 0.883 | 0.776 | 0.763 | 0.852 | 0.784 | 0.766 | 0.895 | 0.743 | 0.727 | 0.913 | 0.783 | 0.718 |
We conducted missing value imputation experiments on five types of transcriptomics data with three different missing rates (70%, 50%, 30%). The results are reported as RMSE and MAE.
Data | Missing Rate | Mean RMSE | Mean MAE | KNN RMSE | KNN MAE | MICE RMSE | MICE MAE | SVD RMSE | SVD MAE | SPEC RMSE | SPEC MAE | GRAPE RMSE | GRAPE MAE | GAIN RMSE | GAIN MAE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BRCA | 70% | 0.119 | 0.092 | 0.109 | 0.081 | 0.106 | 0.079 | 0.099 | 0.076 | 0.104 | 0.076 | 0.127 | 0.099 | 0.117 | 0.089 |
BRCA | 50% | 0.119 | 0.092 | 0.103 | 0.075 | 0.090 | 0.066 | 0.086 | 0.063 | 0.090 | 0.063 | 0.131 | 0.101 | 0.114 | 0.087 |
BRCA | 30% | 0.119 | 0.092 | 0.099 | 0.075 | 0.084 | 0.062 | 0.080 | 0.058 | 0.088 | 0.058 | 0.131 | 0.102 | 0.112 | 0.085 |
COAD | 70% | 0.101 | 0.077 | 0.099 | 0.073 | 0.093 | 0.068 | 0.089 | 0.067 | 0.094 | 0.069 | 0.102 | 0.077 | 0.104 | 0.079 |
COAD | 50% | 0.101 | 0.077 | 0.091 | 0.066 | 0.079 | 0.058 | 0.077 | 0.057 | 0.076 | 0.055 | 0.110 | 0.075 | 0.103 | 0.079 |
COAD | 30% | 0.102 | 0.077 | 0.086 | 0.063 | 0.076 | 0.056 | 0.072 | 0.053 | 0.071 | 0.051 | 0.105 | 0.070 | 0.103 | 0.078 |
GBM | 70% | 0.122 | 0.096 | 0.106 | 0.080 | 0.097 | 0.073 | 0.096 | 0.074 | 0.110 | 0.084 | 0.125 | 0.117 | 0.122 | 0.095 |
GBM | 50% | 0.122 | 0.096 | 0.097 | 0.073 | 0.084 | 0.063 | 0.082 | 0.063 | 0.084 | 0.061 | 0.145 | 0.116 | 0.115 | 0.089 |
GBM | 30% | 0.122 | 0.096 | 0.093 | 0.070 | 0.080 | 0.060 | 0.078 | 0.062 | 0.083 | 0.058 | 0.146 | 0.117 | 0.114 | 0.088 |
LGG | 70% | 0.131 | 0.104 | 0.109 | 0.083 | 0.095 | 0.072 | 0.097 | 0.074 | 0.153 | 0.124 | 0.152 | 0.123 | 0.132 | 0.095 |
LGG | 50% | 0.131 | 0.103 | 0.098 | 0.074 | 0.082 | 0.061 | 0.081 | 0.061 | 0.082 | 0.062 | 0.151 | 0.123 | 0.129 | 0.102 |
LGG | 30% | 0.131 | 0.103 | 0.094 | 0.071 | 0.078 | 0.058 | 0.076 | 0.057 | 0.074 | 0.056 | 0.151 | 0.123 | 0.123 | 0.097 |
OV | 70% | 0.124 | 0.098 | 0.122 | 0.094 | 0.118 | 0.091 | 0.112 | 0.088 | 0.161 | 0.130 | 0.127 | 0.101 | 0.126 | 0.099 |
OV | 50% | 0.124 | 0.098 | 0.109 | 0.083 | 0.102 | 0.078 | 0.100 | 0.075 | 0.098 | 0.078 | 0.126 | 0.099 | 0.125 | 0.098 |
OV | 30% | 0.124 | 0.098 | 0.103 | 0.078 | 0.098 | 0.075 | 0.093 | 0.071 | 0.090 | 0.069 | 0.126 | 0.099 | 0.124 | 0.097 |