Large-Scale-Clustering-on-StackOverflow-Data

Objective

The main objective of the project is to implement K-means Clustering algorithm using Python and Spark on HDFS. Following are the goals achieved through clustering:

-Clustering is implemented on User base data to group similar users on the basis of their skills. Their skills are quantified taking appropriate features.

-Also, K-means algorithm is used to make homogeneous clusters of Posts on the basis of their popularity which is determined by taking suitable features.

-Elbow method is used to identify optimal number of clusters, and machine learning techniques such as normalization and one hot representation is implemented without using mlib library.

-The results are discussed, justified and the performance of the algorithm is evaluated based on output.

-Lastly, Mlib library is used to obtain the outputs for both the cases and the results are compared.

For detailed report refer Kmeans.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Large-Scale-Clustering-on-StackOverflow-Data

Objective

Files

README.md

Latest commit

History

README.md

File metadata and controls

Large-Scale-Clustering-on-StackOverflow-Data

Objective