Skip to content

Clustering infrastructure

lgatto edited this page Apr 24, 2013 · 23 revisions

Defining a clustering infrastructure, similar to the supervised framework currently available.

Use cases

  • Simple/direct (naive) clustering for data exploration and QC. Should work out of the box on an MSnSet and then be plotted with plot2D.
  • Optimisation infrastructure.

Algorithms of interest

  • kmeans, as a baseline clustering method
  • spectral clustering (kerlab::specc)
  • Gaussian mixture models (mclust)
  • other

Interface

Taking kmeans as example, and using the supervised framework as template.

library("pRoloc")
library("pRolocdata")
data(dunkley2006)

Simple clustering

res <- kmeansClustering(dunkley2006, centers = 9)
head(fData(res)$kmeans)
## [1] 5 5 5 5 5 5
plot2D(res, fcol = "kmeans")

plot of chunk clust

Optimising k

(param <- kmeansOptimisation(dunkley2006))
## Object of class "ClustRegRes"
##  Algorithm: kmeans 
##  Criteria: BIC AIC 
##  Parameters:
##   k : 1 2 ... 19 20
getParams(param, "BIC")
## k 
## 6
getParams(param, "AIC")
##  k 
## 10
plot(param)

plot of chunk koptim

levelPlot(param)

plot of chunk koptim

fvarLabels(res2 <- kmeansClustering(dunkley2006, param))
## [1] "markers"    "assigned"   "evidence"   "method"     "new"       
## [6] "pd.2013"    "pd.markers" "kmeans"

Optimise to ground truth

kmeansOptimisation(object, fcol), where fcol represents a feature data column with test cluster definitions, and the function would optimise kmeans and its parameter to match the priors. See clue package for criteria.

Compare clustering results

  • table(fData(res)$kmeans, fData(res)$specc) - possibly requires a renumbering of clusters.
  • something like plotClust(res, fcol = c("kmeans", "specc")) or even plot2D.
  • more than 2 clusters?

References

  • The clue package - tools to compute metrics to validate the quality of a clustering, as well as tools to deal with the comparison of a clustering with a known ground truth.
  • Quick-R Cluster Analysis page.