Reddit, a social news aggregation, web content rating, and discussion website has been increasingly aiming to moderate their user-contributed content to prevent the spread of misinformation. Their content moderation team has approached us, GA Data Science Solutions, to come up with an algorithm to classify satirical and non-satirical news in order to ease improve their workload on content moderation.
While news satire has an important place in literary history, it may be hard for some consumers to recognize the irony and deadpan humour of such journalism in this increasingly crazy world, especially with the ongoing COVID-19 pandemic. While satire news aims to be harmless in its intent, unwary consumers may believe in such wholly fictionalized news stories and spread these misinformation. Hence, Reddit has decided to invest in machine learning algorithms and techniques to optimize the recognition of such posts to properly identify the subreddits they should belong to, and hence overall aiming to increase its reliability as a source of information for the masses.
This project is divided into three notebooks:-
- Data Collection
- Data Cleaning & EDA
- Pre-processing and Modelling
The Data Collection notebook will use Pushshift's API to scrape r/TheOnion and r/worldnews for posts, and collate the scraped posts into a dataset for training up the satirical news classification model. The dataset will then be cleaned up to minimize non-english titles and duplicates. NSFW posts will also be removed. Exploratory data analysis (EDA) will allow us to gain some insights to the datasets such as the top occurring words that may be considered as stop words (e.g. onion) to the cleaned datasets for modelling. Lastly, the datasets will be transformed and fitted into different classification models e.g. Multinomial Naive Bayes, Logistic Regression and their success metrics will be evaluated based on reducing false negative, where satire news are incorrectly classified as non-satirical news.
Count Vectorizer, TF-IDF Vectorizer, Logistic Regression, Multinomial Naive Bayes, GridSearchCV
Python using Numpy, Pandas, Matplotlib, Seaborn, Count Vectorizer, TF-IDF Vectorizer, Logistic Regression, Multinomial Naive Bayes, GridSearchCV
Data Cleaning, EDA, Data Visualization, Modelling Techniques, Interpretation of Results to Non-Technical Audience
LR_2.3 will be the most suitable model for this satirical news classifier based on the summary of scores shown in the notebook. It scores the best in Recall, which is the evaluation metric that should be optimized in order to reduce false negatives. The F1 Score is only slightly compromised (negligible difference of 0.0088) as a result of optimizing Recall, which is still within acceptable range. Similarly for ROC AUC Score where LR_2.3 scored the second lowest among all the other models, it is still within acceptable range.
Hence, I would recommend LR_2.3 to be used as the satirical news classification model.
- K-Nearest Neighbours
- Decision Tree
- Bagged Decision Trees
- Random Forest
- Support Vector Regressor
- Using RandomizedSearchCV instead of only GridSearchCV to explore higher numbers of different hyperparameters
- Optimizing a custom metric that weighs Recall somewhat more importantly than Specificity.
- Looking at my ROC curve and try to find a place where Recall is very high and 1 - Specificity is pretty low.
Instead of scraping only the posts of the subreddit to determine whether it is satirical news, this classification model can also scrape the contents of the linked articles to further determine whether it belongs to satire. This will not only increase the sensitivity of the model, but also ensure that the title of the post will not be able to mislead and by-pass the classification model if the linked article is actually a satire piece.