A Reddit Flair Detector which detects and classifies the type of flair of a post on the subreddit india based on five ML algorithms, namely Naive-Bayes, Linear Support Vector Machine, Logistic Regression, Random Forest, and Multi-Layer Perceptron Classifier.
Data
contains database instance of raw data,its csv and the resulting data after cleaning and pre processing.
Finalized_Model
contains the finalized ML model which gave the maximum accuracy during testing.
Scripts
contains the the files used pre-deployment, that is, the code used for scraping reddit posts and training the model.
Project_Reddit_Flair.ipynb
contains the Jupyter Notebook to collect r/india
data, pre-process it, train the models and test them using the mearsures including accuracy, precision, recall, f1-score, and support measures based on different features of flairs.
flask_app.py
is the main python file which contains the flask web application for Heroku servers.
graph.html
, index.html
, post.html
, result.html
These all were the HTML files which were required to make the Web Application.
Procfile
is the file required to connect the web app to Heroku usiing Heroku CLI.
nltk.txt
contains 'stopwords' to be downloaded from the shell.
requirements.txt
lists all the dependencies required to run the project.
The user enters the url of the required post. The app takes the url, extracts various features from it and tries to predict the flair by applying the finalized model.
-
Open the terminal.
-
Clone the repository
git clone https://github.com/AshuKV/Reddit-Flair-Predictor.git
. -
Create a virtual environment by the command
virtualenv -p python3 env
. -
Activate the
env
virtual environment by executing the following command:source env/bin/activate
. -
Inside the cloned directory, Enter command
pip install -r requirements.txt
. -
Go inside the Web directory and execute command
python3 flask_app.py
, which will start the server. -
Hit the
IP Address
on a web browser to use the app.
requirements.txt
contains the list of all the dependencies.
The data was collected using the praw library in Python. The codebase is located in the Scripts
. Only top ten comments were considered along with their authors. Total 50 posts were considered for data analysis for each of the 12 flairs considered as a part of the project.
The data includes title
, comments
, body
, url
, author
, score
, id
, time-created
and number of comments
. For comments
, only top level comments are considered in dataset and no sub-comments are present. The title
, comments
and body
were cleaned by removing bad symbols and stopwords using nltk
. Basically five types of standalone features were considered for training namely:
Title
Comments
Urls
Body
- Combining
Title
,Comments
andUrls
as one feature.
After the data collection, and going through various literatures available for Natural Language Processing and various ML classifiers, I got across this article which explained everything, from the data pre-processing to data analysis for text classification. The data was cleaned using textcleaning.py Scripts
, which I saved in a csv file.
After cleaning the data, various ML algorithms were trained with training and testing dataset in the ratio 7:3. Basically five ML algorithms were used:
- Naive Bayes
- Linear Suport Vector Machine
- Logistic Regression
- Random Forest
- Multi Layer Perceptron
Training and Testing on the dataset showed the Logistic Regression showed the best accuracy of 66% when trained on the combination of Title
+ Comments
+ Url
feature. The best model was saved using pickle
library, which was used further for prediction of the flair from the URL of the post using the web app.
A flask app was made with these routes - /
the home route, /action_page
for displaying the predicted flair, and /stats
for statistics, later which was deployed to Heroku servers using the flask_app.py
, Procfile
, .gitignore
and requirement.txt
files.