Skip to content

Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Notifications You must be signed in to change notification settings

GaoLabXDU/MultiOmicsIntegrationStudy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 

Repository files navigation

Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping, including iClusterBayes, LRAcluster, SNF, PFA, moCluster, NEMO, CIMLR, MultiNMF, PINS, and Subtype-GAN, in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.

Key points

  • We constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types.
  • We conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency.
  • Refuting the widely held intuition that incorporating more types of omics data always helps produce better results, our analyses showed that there are situations where integrating more types of omics data negatively impacts the performance of the integration methods.
  • The influence of different omics data types varies in cancer subtyping.
  • Our analyses suggested several effective combinations of omics data types for most cancers under our studies (e.g. mRNA + miRNA + CNV and mRNA + DNA methylation) that can indeed improve the accuracy on cancer subtyping.

Datasets description

The Cancer Genome Atlas Program (TCGA) has collected a large number of different types of omics data from more than 30 types of cancers. In this study, four types of omics data were chosen, including copy number variation (CNV) in genome level, DNA methylation and miRNA expression in epigenome level, and mRNA expression in transcriptome level.

To decide the cancer types to be used in this studies, we focused on those cancers which had a sufficient number of samples of the four types of omics data in the TCGA collection and had been used in previous studies on cancer subtyping. As a result, nine common cancers were chosen, including Adrenocortical Carcinoma (ACC), Breast Invasive Carcinoma (BRCA), Colon Adenocarcinoma (COAD), Kidney Renal Papillary Cell Carcinoma (KIRP), Kidney Renal Clear Cell Carcinoma (KIRC), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), and Thymoma (THYM).

To make a comprehensive evaluation and comparison, we constructed three different classes of benchmarking datasets using the four types of omics data of the nine cancers, including Dataset #1: Nine-cancer datasets, Dataset #2: Noise datasets, and Dataset #3: Gold standards datasets.

Because of the inconsistency of the data combinations used in previous studies and the lack of a common guideline, we enumerated all the possible combinations of the four types of omics data. Therefore, there are six two-omic combinations (m+mi, m+me, m+cnv, mi+me, mi+cnv, me+cnv), four three-omic combinations (m+mi+me, m+mi+cnv, m+me+cnv, mi+me+cnv), and one four-omic combination (m+mi+me+cnv), giving a total of 11 possible combinations. We used each of these combinations to construct the three classes of benchmarking datasets illustrated above and these datasets were all used in our experiments.

For the details of the Data pre-processing and Datasets construction, please see our paper.

Availability of datasets

As the size of the whole datasets exceeds the file size limitation of GitHub, we are so sorry that we can not upload the datasets to GitHub. We have uploaded the datasets to figshare, Dropbox, and OneDrive. You can click the links in the table below to download the datasets from different sources.

Datasets figshare Dropbox OneDrive
Dataset #1 BRCA Complete Link Link Link
Dataset #1 BRCA Significant Link Link Link
Dataset #1 COAD Complete Link Link Link
Dataset #1 COAD Significant Link Link Link
Dataset #1 KIRC Complete Link Link Link
Dataset #1 KIRC Significant Link Link Link
Dataset #1 LUAD Complete Link Link Link
Dataset #1 LUAD Significant Link Link Link
Dataset #1 LUSC Complete Link Link Link
Dataset #1 LUSC Significant Link Link Link
Dataset #1 ACC Complete Link Link
Dataset #1 ACC Significant Link Link
Dataset #1 KIRP Complete Link Link
Dataset #1 KIRP Significant Link Link
Dataset #1 LIHC Complete Link Link
Dataset #1 LIHC Significant Link Link
Dataset #1 THYM Complete Link Link
Dataset #1 THYM Significant Link Link
Dataset #2 BRCA 0.5 Link Link Link
Dataset #2 BRCA 1 Link Link Link
Dataset #2 BRCA 2 Link Link Link
Dataset #2 BRCA 3 Link Link Link
Dataset #2 BRCA 4 Link Link Link
Dataset #2 COAD 0.5 Link Link Link
Dataset #2 COAD 1 Link Link Link
Dataset #2 COAD 2 Link Link Link
Dataset #2 COAD 3 Link Link Link
Dataset #2 COAD 4 Link Link Link
Dataset #3 BRCA Complete Link Link Link
Dataset #3 BRCA Significant Link Link Link
Dataset #3 COAD Complete Link Link Link
Dataset #3 COAD Significant Link Link Link
Dataset #3 Pan-cancer Link Link Link

Results

Here, we provide all the results of different methods in this work. Please click the hyperlinks below for downloading.

Method Results: Dataset #1 (Complete Datasets 1, Complete Datasets 2, Significant Datasets 1, Significant Datasets 2), Dataset #2 Noise Datasets (1, 2, 3), Dataset #3 (Gold Standard Datasets, Pan-cancer Datasets)

Contact us

If you are interested in this work and have any questions or suggestions, please send an email to us (duanran9013(at)126.com). We will reply to you as soon as possible. Thanks for your email.

About

Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published