Topic-Modelling using LDA (Latent Dirichlet Allocation)

A topic modeller that uses Latent Dirichlet Allocation to identify multiple clusters(or topics) of key-phrases within a collection of documents. This technique is especially useful when it comes to tasks like document classification and article reccommendation.

The following link offers a concise intuitive explanation of what LDA achieves and the math going on under the hood: https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

Dependencies

python 3+
numpy
pandas
gensim==3.4.0
nltk==3.3
glob2==0.6
autocorrect==0.3.0

Dataset

8888 NYT news articles : https://www.kaggle.com/nzalake52/new-york-times-articles

Getting Started

extract files.zip it contains the data set as well as empty directories to store dictionaries and models
run.py --n_topics -> create database and run topic modeller
view_topics.py --n_topics --gram -> gram can be "unigram" or "both"

Notes

I have tried my best to pre-process the documents (eg remove stopwords) but I encourage further tweaking in that respect
contractions.py has a list of all the words/ phrases to weed out.
I have designed the program to generate two models. One that uses only unigrams and one that utilizes both unigrams as well as bigrams.
Although setting the number of topics is an arbitrary hyper-parameter,there are a number of heuristics that can be used to set the optimum number of topics:https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
I have skipped over that part but might incorporate it in a future update.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LDA.py		LDA.py
README.md		README.md
clean.py		clean.py
contractions.py		contractions.py
create_data.py		create_data.py
files.zip		files.zip
preprocess.py		preprocess.py
run.py		run.py
view_topics.py		view_topics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic-Modelling using LDA (Latent Dirichlet Allocation)

Dependencies

Dataset

Getting Started

Notes

About

Releases

Packages

Languages

rohit-ganapathy/Topic-Modelling-LDA

Folders and files

Latest commit

History

Repository files navigation

Topic-Modelling using LDA (Latent Dirichlet Allocation)

Dependencies

Dataset

Getting Started

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages