Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping, including iClusterBayes, LRAcluster, SNF, PFA, moCluster, NEMO, CIMLR, MultiNMF, PINS, and Subtype-GAN, in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.
- We constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types.
- We conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency.
- Refuting the widely held intuition that incorporating more types of omics data always helps produce better results, our analyses showed that there are situations where integrating more types of omics data negatively impacts the performance of the integration methods.
- The influence of different omics data types varies in cancer subtyping.
- Our analyses suggested several effective combinations of omics data types for most cancers under our studies (e.g. mRNA + miRNA + CNV and mRNA + DNA methylation) that can indeed improve the accuracy on cancer subtyping.
The Cancer Genome Atlas Program (TCGA) has collected a large number of different types of omics data from more than 30 types of cancers. In this study, four types of omics data were chosen, including copy number variation (CNV) in genome level, DNA methylation and miRNA expression in epigenome level, and mRNA expression in transcriptome level.
To decide the cancer types to be used in this studies, we focused on those cancers which had a sufficient number of samples of the four types of omics data in the TCGA collection and had been used in previous studies on cancer subtyping. As a result, nine common cancers were chosen, including Adrenocortical Carcinoma (ACC), Breast Invasive Carcinoma (BRCA), Colon Adenocarcinoma (COAD), Kidney Renal Papillary Cell Carcinoma (KIRP), Kidney Renal Clear Cell Carcinoma (KIRC), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), and Thymoma (THYM).
To make a comprehensive evaluation and comparison, we constructed three different classes of benchmarking datasets using the four types of omics data of the nine cancers, including Dataset #1: Nine-cancer datasets, Dataset #2: Noise datasets, and Dataset #3: Gold standards datasets.
Because of the inconsistency of the data combinations used in previous studies and the lack of a common guideline, we enumerated all the possible combinations of the four types of omics data. Therefore, there are six two-omic combinations (m+mi, m+me, m+cnv, mi+me, mi+cnv, me+cnv), four three-omic combinations (m+mi+me, m+mi+cnv, m+me+cnv, mi+me+cnv), and one four-omic combination (m+mi+me+cnv), giving a total of 11 possible combinations. We used each of these combinations to construct the three classes of benchmarking datasets illustrated above and these datasets were all used in our experiments.
For the details of the Data pre-processing and Datasets construction, please see our paper.
As the size of the whole datasets exceeds the file size limitation of GitHub, we are so sorry that we can not upload the datasets to GitHub. We have uploaded the datasets to figshare, Dropbox, and OneDrive. You can click the links in the table below to download the datasets from different sources.
Datasets | figshare | Dropbox | OneDrive |
---|---|---|---|
Dataset #1 BRCA Complete | Link | Link | Link |
Dataset #1 BRCA Significant | Link | Link | Link |
Dataset #1 COAD Complete | Link | Link | Link |
Dataset #1 COAD Significant | Link | Link | Link |
Dataset #1 KIRC Complete | Link | Link | Link |
Dataset #1 KIRC Significant | Link | Link | Link |
Dataset #1 LUAD Complete | Link | Link | Link |
Dataset #1 LUAD Significant | Link | Link | Link |
Dataset #1 LUSC Complete | Link | Link | Link |
Dataset #1 LUSC Significant | Link | Link | Link |
Dataset #1 ACC Complete | Link | Link | |
Dataset #1 ACC Significant | Link | Link | |
Dataset #1 KIRP Complete | Link | Link | |
Dataset #1 KIRP Significant | Link | Link | |
Dataset #1 LIHC Complete | Link | Link | |
Dataset #1 LIHC Significant | Link | Link | |
Dataset #1 THYM Complete | Link | Link | |
Dataset #1 THYM Significant | Link | Link | |
Dataset #2 BRCA 0.5 | Link | Link | Link |
Dataset #2 BRCA 1 | Link | Link | Link |
Dataset #2 BRCA 2 | Link | Link | Link |
Dataset #2 BRCA 3 | Link | Link | Link |
Dataset #2 BRCA 4 | Link | Link | Link |
Dataset #2 COAD 0.5 | Link | Link | Link |
Dataset #2 COAD 1 | Link | Link | Link |
Dataset #2 COAD 2 | Link | Link | Link |
Dataset #2 COAD 3 | Link | Link | Link |
Dataset #2 COAD 4 | Link | Link | Link |
Dataset #3 BRCA Complete | Link | Link | Link |
Dataset #3 BRCA Significant | Link | Link | Link |
Dataset #3 COAD Complete | Link | Link | Link |
Dataset #3 COAD Significant | Link | Link | Link |
Dataset #3 Pan-cancer | Link | Link | Link |
Here, we provide all the results of different methods in this work. Please click the hyperlinks below for downloading.
Method Results: Dataset #1 (Complete Datasets 1, Complete Datasets 2, Significant Datasets 1, Significant Datasets 2), Dataset #2 Noise Datasets (1, 2, 3), Dataset #3 (Gold Standard Datasets, Pan-cancer Datasets)
If you are interested in this work and have any questions or suggestions, please send an email to us (duanran9013(at)126.com). We will reply to you as soon as possible. Thanks for your email.