Skip to content

Latest commit

 

History

History
18 lines (9 loc) · 971 Bytes

File metadata and controls

18 lines (9 loc) · 971 Bytes

Large-Scale-Clustering-on-StackOverflow-Data

Objective

The main objective of the project is to implement K-means Clustering algorithm using Python and Spark on HDFS. Following are the goals achieved through clustering:

-Clustering is implemented on User base data to group similar users on the basis of their skills. Their skills are quantified taking appropriate features.

-Also, K-means algorithm is used to make homogeneous clusters of Posts on the basis of their popularity which is determined by taking suitable features.

-Elbow method is used to identify optimal number of clusters, and machine learning techniques such as normalization and one hot representation is implemented without using mlib library.

-The results are discussed, justified and the performance of the algorithm is evaluated based on output.

-Lastly, Mlib library is used to obtain the outputs for both the cases and the results are compared.

For detailed report refer Kmeans.pdf