K-means clustering algorithm

K-means clustering is a Unsupervised Machine learning algorithm, used for clustering data into a predefined number of groups. It is effective when dealing with separated data. This algorithm operates iteratively to assign each data point to one of K-clusters based on the features that are provided.

Steps

Initialization: Randomly select k cluster centers (centroids) from the data points.
Assignment Step: Assign each data point to nearest centroid. The nearest is determined using a distance metric, typically the Euclidean Distance¹.
Recompute Centroids: For each of the k clusters, update the cluster centroid by calculating the mean of all points assigned to the cluster.
Repeat: Repeat the Assignment and update steps until convergence is reached or a predefined number of iterations is completed.

graph TD;  
A[Initialize k centroids randomly] --> B[Assign each data point to the nearest centroid];  
B --> C[Update centroids to be the mean of the points in the cluster];  
C --> B;  
B --> D[Convergence or max iterations reached];  
D --> E[End];  
style A fill:#59CE3F,stroke:white,stroke-width:2px;  
style B fill:#1683C2,stroke:white,stroke-width:2px;  
style C fill:#D0C43A,stroke:white,stroke-width:2px;  
style D fill:#2EA18A,stroke:white,stroke-width:2px;  
style E fill:#BB3559,stroke:white,stroke-width:2px;

Loading

Example

Q1. {2, 3, 8, 10, 15, 18}
Suppose we have k = 2,
Let's randomly select initial centroids. Let's say we have C₁ = 3 and C₂ = 15.

Assigning data points to clusters:

Data Points	d(x, C₁) = d(x,3)	d(x, C₂) = d(x,15)	Nearest centroid
2	1	13	C₁
3	0	12	C₁
8	5	7	C₁
10	7	5	C₂
15	12	0	C₂
18	15	3	C₂

Assignments based on initial centroids:

Cluster 1= {2, 3, 8}
Cluster 2= {10, 15, 18}

calculating the centroids according to Assigned data points:

$$C_1 = \frac{2 + 3 + 8}{3} = \frac{13}{3} \approx 4.33$$
$$C_2 = \frac{10 + 15 + 18}{3} = \frac{43}{3} \approx 14.33$$

Assigning data points according to updated clusters

Data Points	d(x, C₁) = d(x,4.33)	d(x, C₂) = d(x,14.33)	Nearest centroid
2	2.33	12.33	C₁
3	1.33	11.33	C₁
8	3.67	6.33	C₁
10	5.67	4.33	C₂
15	10.67	0.67	C₂
18	13.67	3.67	C₂

Assignments based on updated centroids:

Cluster 1= {2, 3, 8}
Cluster 2= {10, 15, 18}

Re-calculating the centroid based on new assigned datapoint:
$$C_1 = \frac{2 + 3 + 8}{3} = \frac{13}{3} \approx 4.33$$
$$C_2 = \frac{10 + 15 + 18}{3} = \frac{43}{3} \approx 14.33$$

Since the assignments did not change, the algorithm has converged.

Final Cluster:

Cluster 1: {2,3,8} with centroid approximately 4.33
Cluster 2: {10,15,18} with centroid approximately 14.33

Footnotes

Euclidean Distance Formula:
d(x,c) = √(∑ⁿ_i=1 (x_i - c_i)²)

where:
- d(x,c): This denotes the Euclidean distance between two points x and c.
- x: represents a data point in n-dimensional space. In other words, x = {x₁, x₂, x₃,....,x_n}
- c: Represents a centroid or another data point that serves as a reference point in the same n-dimensional space. Similar to x, c = {c₁, c₂, c₃,....,c_n}
- i: This is an index that ranges from 1 to n, representing each component or feature of the data points x and c.
↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K-mean cluster.md

K-mean cluster.md

K-means clustering algorithm

Steps

Example

Files

K-mean cluster.md

Latest commit

History

K-mean cluster.md

File metadata and controls

K-means clustering algorithm

Steps

Example

Footnotes