Skip to content

Latest commit

 

History

History
93 lines (65 loc) · 4 KB

K-mean cluster.md

File metadata and controls

93 lines (65 loc) · 4 KB

K-means clustering algorithm

K-means clustering is a Unsupervised Machine learning algorithm, used for clustering data into a predefined number of groups. It is effective when dealing with separated data. This algorithm operates iteratively to assign each data point to one of K-clusters based on the features that are provided.

Steps

  1. Initialization: Randomly select k cluster centers (centroids) from the data points.

  2. Assignment Step: Assign each data point to nearest centroid. The nearest is determined using a distance metric, typically the Euclidean Distance1.

  3. Recompute Centroids: For each of the k clusters, update the cluster centroid by calculating the mean of all points assigned to the cluster.

  4. Repeat: Repeat the Assignment and update steps until convergence is reached or a predefined number of iterations is completed.

graph TD;  
A[Initialize k centroids randomly] --> B[Assign each data point to the nearest centroid];  
B --> C[Update centroids to be the mean of the points in the cluster];  
C --> B;  
B --> D[Convergence or max iterations reached];  
D --> E[End];  
style A fill:#59CE3F,stroke:white,stroke-width:2px;  
style B fill:#1683C2,stroke:white,stroke-width:2px;  
style C fill:#D0C43A,stroke:white,stroke-width:2px;  
style D fill:#2EA18A,stroke:white,stroke-width:2px;  
style E fill:#BB3559,stroke:white,stroke-width:2px;  
Loading

Example

Q1. {2, 3, 8, 10, 15, 18}
Suppose we have k = 2,
Let's randomly select initial centroids. Let's say we have C1 = 3 and C2 = 15.

Assigning data points to clusters:

Data Points d(x, C1) = d(x,3) d(x, C2) = d(x,15) Nearest centroid
2 1 13 C1
3 0 12 C1
8 5 7 C1
10 7 5 C2
15 12 0 C2
18 15 3 C2

Assignments based on initial centroids:

  • Cluster 1= {2, 3, 8}
  • Cluster 2= {10, 15, 18}

calculating the centroids according to Assigned data points:

  • $$C_1 = \frac{2 + 3 + 8}{3} = \frac{13}{3} \approx 4.33$$

  • $$C_2 = \frac{10 + 15 + 18}{3} = \frac{43}{3} \approx 14.33$$

Assigning data points according to updated clusters

Data Points d(x, C1) = d(x,4.33) d(x, C2) = d(x,14.33) Nearest centroid
2 2.33 12.33 C1
3 1.33 11.33 C1
8 3.67 6.33 C1
10 5.67 4.33 C2
15 10.67 0.67 C2
18 13.67 3.67 C2

Assignments based on updated centroids:

  • Cluster 1= {2, 3, 8}

  • Cluster 2= {10, 15, 18}

    Re-calculating the centroid based on new assigned datapoint:

  • $$C_1 = \frac{2 + 3 + 8}{3} = \frac{13}{3} \approx 4.33$$

  • $$C_2 = \frac{10 + 15 + 18}{3} = \frac{43}{3} \approx 14.33$$

Since the assignments did not change, the algorithm has converged.

Final Cluster:

  • Cluster 1: {2,3,8} with centroid approximately 4.33

  • Cluster 2: {10,15,18} with centroid approximately 14.33

Footnotes

  1. Euclidean Distance Formula:
    d(x,c) = √(∑ni=1 (xi - ci)2)

    where:

    • d(x,c): This denotes the Euclidean distance between two points x and c.
    • x: represents a data point in n-dimensional space. In other words, x = {x1, x2, x3,....,xn}
    • c: Represents a centroid or another data point that serves as a reference point in the same n-dimensional space. Similar to x, c = {c1, c2, c3,....,cn}
    • i: This is an index that ranges from 1 to n, representing each component or feature of the data points x and c.