K-means clustering is a Unsupervised Machine learning algorithm, used for clustering data into a predefined number of groups. It is effective when dealing with separated data. This algorithm operates iteratively to assign each data point to one of K-clusters based on the features that are provided.
-
Initialization: Randomly select k cluster centers (centroids) from the data points.
-
Assignment Step: Assign each data point to nearest centroid. The nearest is determined using a distance metric, typically the Euclidean Distance1.
-
Recompute Centroids: For each of the k clusters, update the cluster centroid by calculating the mean of all points assigned to the cluster.
-
Repeat: Repeat the Assignment and update steps until convergence is reached or a predefined number of iterations is completed.
graph TD;
A[Initialize k centroids randomly] --> B[Assign each data point to the nearest centroid];
B --> C[Update centroids to be the mean of the points in the cluster];
C --> B;
B --> D[Convergence or max iterations reached];
D --> E[End];
style A fill:#59CE3F,stroke:white,stroke-width:2px;
style B fill:#1683C2,stroke:white,stroke-width:2px;
style C fill:#D0C43A,stroke:white,stroke-width:2px;
style D fill:#2EA18A,stroke:white,stroke-width:2px;
style E fill:#BB3559,stroke:white,stroke-width:2px;
Q1. {2, 3, 8, 10, 15, 18}
Suppose we have k = 2,
Let's randomly select initial centroids. Let's say we have C1 = 3 and C2 = 15.
Assigning data points to clusters:
Data Points | d(x, C1) = d(x,3) | d(x, C2) = d(x,15) | Nearest centroid |
---|---|---|---|
2 | 1 | 13 | C1 |
3 | 0 | 12 | C1 |
8 | 5 | 7 | C1 |
10 | 7 | 5 | C2 |
15 | 12 | 0 | C2 |
18 | 15 | 3 | C2 |
Assignments based on initial centroids:
- Cluster 1= {2, 3, 8}
- Cluster 2= {10, 15, 18}
calculating the centroids according to Assigned data points:
-
$$C_1 = \frac{2 + 3 + 8}{3} = \frac{13}{3} \approx 4.33$$ -
$$C_2 = \frac{10 + 15 + 18}{3} = \frac{43}{3} \approx 14.33$$
Assigning data points according to updated clusters
Data Points | d(x, C1) = d(x,4.33) | d(x, C2) = d(x,14.33) | Nearest centroid |
---|---|---|---|
2 | 2.33 | 12.33 | C1 |
3 | 1.33 | 11.33 | C1 |
8 | 3.67 | 6.33 | C1 |
10 | 5.67 | 4.33 | C2 |
15 | 10.67 | 0.67 | C2 |
18 | 13.67 | 3.67 | C2 |
Assignments based on updated centroids:
-
Cluster 1= {2, 3, 8}
-
Cluster 2= {10, 15, 18}
Re-calculating the centroid based on new assigned datapoint:
-
$$C_1 = \frac{2 + 3 + 8}{3} = \frac{13}{3} \approx 4.33$$ -
$$C_2 = \frac{10 + 15 + 18}{3} = \frac{43}{3} \approx 14.33$$
Since the assignments did not change, the algorithm has converged.
Final Cluster:
-
Cluster 1: {2,3,8} with centroid approximately 4.33
-
Cluster 2: {10,15,18} with centroid approximately 14.33
Footnotes
-
Euclidean Distance Formula:
d(x,c) = √(∑ni=1 (xi - ci)2)where:
- d(x,c): This denotes the Euclidean distance between two points x and c.
- x: represents a data point in n-dimensional space. In other words, x = {x1, x2, x3,....,xn}
- c: Represents a centroid or another data point that serves as a reference point in the same n-dimensional space. Similar to x, c = {c1, c2, c3,....,cn}
- i: This is an index that ranges from 1 to n, representing each component or feature of the data points x and c.