A Topic Modelling model is created using LDA from gensim library. LDA can be understood from this youtube video.
Clone this repository on your laptop or download files by clicking here.
A model for Topic Modelling using LDA is made by using the gensim library.
This Topic Modelling Model can be used for any English Database. To see an example of how it can be used for Movie Recommendation, check out this repository.
For the model, Wikipedia dump has been used as the Dataset, which has over 4 million articles in English. The dataset can be found here. The dataset size is 16.2 GB.
Written in requirements.txt. Using a virtual environment is recommended.
pip install -r requirements.txt
The code for tpreprocessing dataset is written in create_wiki_corpus.py.
Note: This process will take around 10 hours to complete. Output file is a gensim corpus of size 34.6 GB, so it's not uploaded.
The code to train the model is written in the script train_lda_model.py.
The model has been trained via unsupervised learning on the complete dataset of all Wikipedia English articles. The number of topics trained on the model is 130.
Note: This process will take around 6 hours to complete. The model files have already been saved here in the Models folder.
The code for checking the topics inside the model can be found in show_model_topics.py.
Run the code to see the topics. The topics have a number id. It can be seen that the words in the topics have similaritites among them.
Model can be improved by tweaking the number of topics. This strictly depends on usage.
python load_model.py
This will return list of topics the model has made.
GNU GENERAL PUBLIC LICENSE Version 2
Arnav Deep © June 2020. All rights reserved.