GitHub - GaoLabXDU/MultiOmicsIntegrationStudy: Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping, including iClusterBayes, LRAcluster, SNF, PFA, moCluster, NEMO, CIMLR, MultiNMF, PINS, and Subtype-GAN, in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.

Key points

We constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types.
We conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency.
Refuting the widely held intuition that incorporating more types of omics data always helps produce better results, our analyses showed that there are situations where integrating more types of omics data negatively impacts the performance of the integration methods.
The influence of different omics data types varies in cancer subtyping.
Our analyses suggested several effective combinations of omics data types for most cancers under our studies (e.g. mRNA + miRNA + CNV and mRNA + DNA methylation) that can indeed improve the accuracy on cancer subtyping.

Datasets description

The Cancer Genome Atlas Program (TCGA) has collected a large number of different types of omics data from more than 30 types of cancers. In this study, four types of omics data were chosen, including copy number variation (CNV) in genome level, DNA methylation and miRNA expression in epigenome level, and mRNA expression in transcriptome level.

To decide the cancer types to be used in this studies, we focused on those cancers which had a sufficient number of samples of the four types of omics data in the TCGA collection and had been used in previous studies on cancer subtyping. As a result, nine common cancers were chosen, including Adrenocortical Carcinoma (ACC), Breast Invasive Carcinoma (BRCA), Colon Adenocarcinoma (COAD), Kidney Renal Papillary Cell Carcinoma (KIRP), Kidney Renal Clear Cell Carcinoma (KIRC), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), and Thymoma (THYM).

To make a comprehensive evaluation and comparison, we constructed three different classes of benchmarking datasets using the four types of omics data of the nine cancers, including Dataset #1: Nine-cancer datasets, Dataset #2: Noise datasets, and Dataset #3: Gold standards datasets.

Because of the inconsistency of the data combinations used in previous studies and the lack of a common guideline, we enumerated all the possible combinations of the four types of omics data. Therefore, there are six two-omic combinations (m+mi, m+me, m+cnv, mi+me, mi+cnv, me+cnv), four three-omic combinations (m+mi+me, m+mi+cnv, m+me+cnv, mi+me+cnv), and one four-omic combination (m+mi+me+cnv), giving a total of 11 possible combinations. We used each of these combinations to construct the three classes of benchmarking datasets illustrated above and these datasets were all used in our experiments.

For the details of the Data pre-processing and Datasets construction, please see our paper.

Availability of datasets

As the size of the whole datasets exceeds the file size limitation of GitHub, we are so sorry that we can not upload the datasets to GitHub. We have uploaded the datasets to figshare, Dropbox, and OneDrive. You can click the links in the table below to download the datasets from different sources.

Datasets	figshare	Dropbox	OneDrive
Dataset #1 BRCA Complete	Link	Link	Link
Dataset #1 BRCA Significant	Link	Link	Link
Dataset #1 COAD Complete	Link	Link	Link
Dataset #1 COAD Significant	Link	Link	Link
Dataset #1 KIRC Complete	Link	Link	Link
Dataset #1 KIRC Significant	Link	Link	Link
Dataset #1 LUAD Complete	Link	Link	Link
Dataset #1 LUAD Significant	Link	Link	Link
Dataset #1 LUSC Complete	Link	Link	Link
Dataset #1 LUSC Significant	Link	Link	Link
Dataset #1 ACC Complete	Link		Link
Dataset #1 ACC Significant	Link		Link
Dataset #1 KIRP Complete	Link		Link
Dataset #1 KIRP Significant	Link		Link
Dataset #1 LIHC Complete	Link		Link
Dataset #1 LIHC Significant	Link		Link
Dataset #1 THYM Complete	Link		Link
Dataset #1 THYM Significant	Link		Link
Dataset #2 BRCA 0.5	Link	Link	Link
Dataset #2 BRCA 1	Link	Link	Link
Dataset #2 BRCA 2	Link	Link	Link
Dataset #2 BRCA 3	Link	Link	Link
Dataset #2 BRCA 4	Link	Link	Link
Dataset #2 COAD 0.5	Link	Link	Link
Dataset #2 COAD 1	Link	Link	Link
Dataset #2 COAD 2	Link	Link	Link
Dataset #2 COAD 3	Link	Link	Link
Dataset #2 COAD 4	Link	Link	Link
Dataset #3 BRCA Complete	Link	Link	Link
Dataset #3 BRCA Significant	Link	Link	Link
Dataset #3 COAD Complete	Link	Link	Link
Dataset #3 COAD Significant	Link	Link	Link
Dataset #3 Pan-cancer	Link	Link	Link

Results

Here, we provide all the results of different methods in this work. Please click the hyperlinks below for downloading.

Method Results: Dataset #1 (Complete Datasets 1, Complete Datasets 2, Significant Datasets 1, Significant Datasets 2), Dataset #2 Noise Datasets (1, 2, 3), Dataset #3 (Gold Standard Datasets, Pan-cancer Datasets)

Contact us

If you are interested in this work and have any questions or suggestions, please send an email to us (duanran9013(at)126.com). We will reply to you as soon as possible. Thanks for your email.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Method Results		Method Results
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Key points

Datasets description

Availability of datasets

Results

Contact us

About

Releases

Packages

GaoLabXDU/MultiOmicsIntegrationStudy

Folders and files

Latest commit

History

Repository files navigation

Experimental evaluation and comparison of multi-omics data integration methods for cancer subtyping

Key points

Datasets description

Availability of datasets

Results

Contact us

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages