Skip to content

An implementation of Bisecting KMeans Clustering which is a kind of Hierarchical Clustering algorithm on Spark

License

Notifications You must be signed in to change notification settings

yu-iskw/bisecting-kmeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bisecting K-Meams Clustering

This is a prototype implementation of Bisecting K-Means Clustering on Spark. Bisecting K-Means is like a combination of K-Means and hierarchical clustering.

Build Status License codecov.io

Scala API

Those are the Scala APIs of Bisecting K-Means Clustering. BisectingKMeans is the class to train a BisectingKMeansModel. You could train a model with BisectingKMeans.train method. And the class has a few parameters.

  • setK: the number of clusters you want
  • setMaxIterations: the number of iterations at each step
  • setSeed: random seed
import org.apache.spark.mllib.bisectingkmeans.{BisectingKMeans, BisectingKMeansModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}

# Prepare for the input data
val localData = (1 to 100).toSeq.map { i =>
  val label = i % 5
  val vector = Vectors.dense(label, label, label)
  (label, vector)
}
val data = sc.parallelize(localData.map(_._2))

# Create an object for this algorithm
val algo = new BisectingKMeans()
  .setK(5)
  .setMaxIterations(20)
  .setSeed(1)

# Train a model
val model = algo.run(data)

# Get trained centers
val centers: Array[Vector] = model.getCenters

# Computes Within Set Sum of Squared Error(WSSSE)
val cost: Double = model.WSSSE(data)

# Convert a cluster tree into an adjacency list
val list: Array[(Int, Int, Double)] = model.toAdjacencyList

# Convert a cluster tree into a linkage matrix
val matrix: Array[(Int, Int, Double, Int)] = model.toLinkageMatrix

Reference

"A comparison of document clustering techniques", M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. pdf

About

An implementation of Bisecting KMeans Clustering which is a kind of Hierarchical Clustering algorithm on Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published