Skip to content

stephanie-kuihg/satirical-news-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Satirical News Classifier

Objective of Classifier

Reddit, a social news aggregation, web content rating, and discussion website has been increasingly aiming to moderate their user-contributed content to prevent the spread of misinformation. Their content moderation team has approached us, GA Data Science Solutions, to come up with an algorithm to classify satirical and non-satirical news in order to ease improve their workload on content moderation.

While news satire has an important place in literary history, it may be hard for some consumers to recognize the irony and deadpan humour of such journalism in this increasingly crazy world, especially with the ongoing COVID-19 pandemic. While satire news aims to be harmless in its intent, unwary consumers may believe in such wholly fictionalized news stories and spread these misinformation. Hence, Reddit has decided to invest in machine learning algorithms and techniques to optimize the recognition of such posts to properly identify the subreddits they should belong to, and hence overall aiming to increase its reliability as a source of information for the masses.

Project Description

This project is divided into three notebooks:-

  1. Data Collection
  2. Data Cleaning & EDA
  3. Pre-processing and Modelling

The Data Collection notebook will use Pushshift's API to scrape r/TheOnion and r/worldnews for posts, and collate the scraped posts into a dataset for training up the satirical news classification model. The dataset will then be cleaned up to minimize non-english titles and duplicates. NSFW posts will also be removed. Exploratory data analysis (EDA) will allow us to gain some insights to the datasets such as the top occurring words that may be considered as stop words (e.g. onion) to the cleaned datasets for modelling. Lastly, the datasets will be transformed and fitted into different classification models e.g. Multinomial Naive Bayes, Logistic Regression and their success metrics will be evaluated based on reducing false negative, where satire news are incorrectly classified as non-satirical news.

Methods Used

Count Vectorizer, TF-IDF Vectorizer, Logistic Regression, Multinomial Naive Bayes, GridSearchCV

Tools

Python using Numpy, Pandas, Matplotlib, Seaborn, Count Vectorizer, TF-IDF Vectorizer, Logistic Regression, Multinomial Naive Bayes, GridSearchCV

Needs of Project

Data Cleaning, EDA, Data Visualization, Modelling Techniques, Interpretation of Results to Non-Technical Audience

Recommendations for the Satirical News Classifier

LR_2.3 will be the most suitable model for this satirical news classifier based on the summary of scores shown in the notebook. It scores the best in Recall, which is the evaluation metric that should be optimized in order to reduce false negatives. The F1 Score is only slightly compromised (negligible difference of 0.0088) as a result of optimizing Recall, which is still within acceptable range. Similarly for ROC AUC Score where LR_2.3 scored the second lowest among all the other models, it is still within acceptable range.

Hence, I would recommend LR_2.3 to be used as the satirical news classification model.

Further Development

1. Other classification models that can be trained and evaluated:

  • K-Nearest Neighbours
  • Decision Tree
  • Bagged Decision Trees
  • Random Forest
  • Support Vector Regressor

2. Further hyperparameter tuning that can be done:

  • Using RandomizedSearchCV instead of only GridSearchCV to explore higher numbers of different hyperparameters

3. More in-depth model evaluation metrics that can be done:

  • Optimizing a custom metric that weighs Recall somewhat more importantly than Specificity.
  • Looking at my ROC curve and try to find a place where Recall is very high and 1 - Specificity is pretty low.

4. To increase business value

Instead of scraping only the posts of the subreddit to determine whether it is satirical news, this classification model can also scrape the contents of the linked articles to further determine whether it belongs to satire. This will not only increase the sensitivity of the model, but also ensure that the title of the post will not be able to mislead and by-pass the classification model if the linked article is actually a satire piece.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published