Abstract:
Clustering algorithms have been used to help discover cancer subtypes, but comparison of the different algorithms had not been performed before the replicated paper [1]. This study provides a comprehensive comparison of the algorithms to guide future algorithm selection for cancer subtypes research. The study evaluates and compares the clustering of cancer gene expression data using seven clustering algorithms and eight different proximity measures. The corrected Rand index (cR) assessed clustering performance. The replicated analysis is different from the original study, with k-means outperforming other methods and the finite mixture of Gaussians ranking second for Affymetrix data sets. For cDNA, spectral and shared nearest neighbors performed best. Furthermore, Manhattan distance yielded the best mean cR indices for Affymetrix datasets. In addition, analysis with PCA reduced performance, likely due to information loss. Tissue type and microarray technology also influenced clustering results, with blood tissue datasets achieving better classification and higher cR indices compared to brain tissue datasets and cDNA datasets displaying better classification than Affymetrix datasets in general. Better classification performance was observed for k-means clustering and PCA compared to hierarchical clustering on a selected blood tissue dataset (cDNA and Affymetrix). This study performs a recapitulation and expands on the original study by examining the impact on classification performance of an additional proximity measure (Manhattan distance), exploring datasets from different microarray platforms, and analyzing datasets containing diverse tissue types.