An experiment to identify and group similar topics from Finnish newspaper articles by combining Latent Semantic Analysis and HDBSCAN clustering algoritm.
- HDBSCAN - Clustering algorithm
- Pandas - Data structuring and analysis
- Scikit-learn - Machine learning tools for data mining and data analysis
- NLTK - Natural language toolkit
- HFST - Helsinki Finite-State Transducer toolkit Python bindings
- Flask - Web framework
- Marshmallow - Object serialization
- Bokeh - Visualization library
Get compiled omorfi.describe.hfst model or build it yourself. Then:
pipenv install
-
Get some articles. Used data fields are title, ingress, body, datetime
-
Load and preprocess data with Pandas
-
Give weights to words using TF-IDF (term frequency–inverse document frequency) algoritm. Use Helsinki Finite-State Transducer toolkit with Omorfi lemmatization model for word lemmatization. Stem words using nltk SnowballStemmer.
-
Combine TF-IDF with Latent Semantic Analysis for a matrix dimensionality reduction
-
Identify natural groupings of the documents using density based document clustering algoritm HDBSCAN
-
After clustering, topics are created for each cluster using topic modeling technique called Latent Dirichlet Allocation
-
Visualize clusters with Bokeh
-
Serialize generated result object with Marshmallow
-
Use Flask to serve the visualization and JSON result
Unsupervised Learning is a class of Machine Learning techniques to find the patterns in data. The data given to unsupervised algorithm are not labelled, which means only the input variables are given with no corresponding output variables. In unsupervised learning, the algorithms are left to themselves to discover interesting structures in the data. Source
Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields. Source
HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters. Source
TF-IDF, which stands for term frequency — inverse document frequency, is a scoring measure widely used in information retrieval or summarization. TF-IDF is intended to reflect how relevant a term is in a given document. Source
Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. Source
Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Source