Kaylah Thomas, Sruti Kanthan, and Angelica Bosko
Tracking key topics and sentiment towards women during a 12-month period on the subreddit r/TheRedPill.
Reddit spaces can house large communities where like-minded people are able to engage in discussion about relevant or interesting topics. On rare occasions, these communities can harbor prejudiced and violent opinions about groups of people. The particular reddit r/TheRedPill is a place where many users, particularly men, share an overwhelmingly negative sentiment about women. In our research, we intend to analyze trends in discussion about women within the subreddit r/TheRedPill, as well as changes in sentiment over time compared to subreddits with generally neutral or overwhelmingly positive discussion about women. This may offer some insight into how dangerous group think may have a negative impact on marginalized communities both online and offline.
In this project, we will be using text data gathered from the comment section of posts under the subreddit r/TheRedPill in 2016. In order to gather this text data, we first use a database (pushshift.io) containing all reddit data from a particular month in 2016 and filter only for the particular subreddit data we choose to examine. For our data collection, we will be looking at r/TheRedPill, r/Feminism, and r/technews. Each monthly raw database contains around 8GB of data.
For our analysis, we wanted to use information from the comment section of three different subreddits in 2016: r/TheRedPill, r/Feminism, and r/technews. We used the subreddit r/TheRedPill to track the overall sentiment and topics discussed on the subreddit month-by-month in 2016. We used the subreddit r/Feminism as a secondary source to understand sentiment and topics discussed in a subreddit with opposing ideology from r/TheRedPill. We also decided to use r/technews as a good "control" subreddit. We believe that r/technews should contain mostly neutral sentiment, as opposed to the sentiment of either r/TheRedPill or r/Feminism.
In order to retrieve month-by-month comment data for each subreddit in 2016, we downloaded 12 files containing all Reddit comments for one particular month. The 12 monthly files (located on https://files.pushshift.io/reddit/comments/), were then downloaded to our local machines, where we pre-process the data to only contain comment information and time information for comments in our target subreddits. Each monthly raw data file for 2016 ranged between 6GB of data and 8GB of data.
In the data_collection.ipynb file,
we include information about how to parse data files from PushShift.io for selected subreddits using PyWren
.
Due to the large data size of these files and hardware limitations, we were unable to successfully use PyWren
and consequently, used our local computers in
order to pickle the necessary data. In the data_collection.ipynb file, we also included code in order to run
each of the raw data files locally. After successfully pickling each of the individual monthly files,
we store all the comments into one corpus, which can be seen at the end of the notebook.
After pickling together all the individual files, the final file (comments_corpus_final.pickle) was 182.6 MB in size. The joined pickle file is located on Google Drive at this link:
https://drive.google.com/drive/u/0/folders/1kgnMtWss9kZBtJvI6wyEao8vdLCnv48W
In order to run both topic analysis and sentiment analysis, the data had to be cleaned appropriately.
We used Dask
in order to parallelize the data cleaning process. To use Dask
appropriately, we
had to create a EMR Cluster with 6 m5.xlarge instances to properly handle the data size.
In the reddit_data_cleaning_dask.ipynb notebook, we first installed the packages necessary to clean our files. The installation for these packages is as follows:
! pip install nltk
! pip install spacy
! pip install dask
! pip install graphviz
! pip install dask[complete]
We then imported all of the necessary packages for data cleaning. The imported packages are as follows:
import dask
import pickle
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import nltk
import pandas as pd
import spacy
import time
After importing the necessary packages, we also downloaded 'stopwords', 'wordnet', and 'punkt' for the data cleaning. When starting the cluster, we requested 8 workers with 1 core each and 4 GiB of memory.
The first function, clean_comments, cleans the comments by getting rid of words and unnecessary spaces (\n). Our second function, remove_stopwords, removes all the stop words from the comments as provided in the nltk package. Our third function, lemmatize, sorts the words by grouping them by similarity. Our fourth function, clean_text, runs all of these functions and stores the information in a dataframe.
After the fourth function, Dask
runs the clean_text function in order to parallelize the cleaning process.
The final product, a dataframe, can be seen within the notebook. The parallelization of the resulting workflow partitioned the dataframe into twelve different tasks, and ran the aforementioned functions on each portion of the dask dataframe.
Due to the size of the files, it is possible that the system may run out of memory. Therefore, we created a separate notebook containing code on running data cleaning on a local machine. This code uses a similar approach to Dask, without using AWS.
Word2Vec is an algorithm used in natural language processing. Specifically, the algorithm uses neural networks to learn word associations using a large amount of text. This algorithm fits perfectly with our corpus, given that we are using a large amount of data (84GB). We also decided to use Word2Vec because this algorithm converts each distinct word into a vector, and allows for easy access to word synonyms, which allows us to analyze relationships between different words in the vocabulary. In this notebook, we tokenized our lemmatized data in PySpark and fed it to our word2vec model, which transformed our text into numeric vectors for analysis. Then, we took a list of salient keywords that are used frequently within the red pill forum (girl, woman, feminist, stacy, chad, becky, beta, love, wish, deserve), and searched for their synonyms, which were the words that had the highest similarity scores to the keywords. We derived the top ten most similar words to each keyword for all keywords in our list.
In this project, we use sentiment analysis on the comments from our target subreddits in order to better understand the valence of each comment. We also want to track sentiment over a 12-month period, gauging whether sentiment has gotten more or less negative over time. By understanding change in sentiment, we can better understand dangerous group think and its threat on marginalized communities.
To perform our sentiment analysis on Spark, we used the Vader library. This library is especially suited for social media data. We calculated “compound” scores for each comment in our data, and this score is computed by summing the valence scores of each word in the vocabulary and normalizing them to be between -1, which would be extremely negative, and +1, extremely positive. This score is a unidimensional measure of sentiment, which we then plotted as averages across each week of the year in our data for each of the subreddits of interest in Dask.
Topic analysis is similar to sentiment analysis, but focuses on the most frequent topics mentioned rather than the overall sentiment of a text. Topic analysis is a useful and widely used machine learning technique that allows users to directly assign topics to text data. Topic analysis can use any unstructured form of text data and process the information in order to analyze the most common topics of the text. In our project, we use topic analysis to better understand the main focus of our target subreddits. For example, we want to not only understand the sentiment behind the comments, but what the comments are most frequently about. For r/TheRedPill, we want to see whether most of the negative sentiment is directed towards women and femininity.
In order to understand the top words used in each target subreddit, we decided to use the package "wordcloud" using PyPI. In order to install the package, we used the following code:
! pip install wordcloud
The wordcloud package takes the comment data and creates a nice visual representation of the data. In the reddit_word_cloud.ipynb notebook, we separated the overall dataframe into three dataframes by subreddit (r/TheRedPill, r/Feminism, r/technews). After separating by subreddit, we imported the wordcloud library and combined all of the text data in each subreddit into one long string. After, we were able to use the package to generate wordcloud images and save them as png files.
Here, you can view the word cloud images for each subreddit:
r/TheRedPill:
r/Feminism:
r/technews:
For r/TheRedPill, we can see that the most frequent words are "one" and "man". For r/Feminism, the most frequent words are "women" and "people". For r/technews, the more frequent words are "people" and "make".
Overall, these word clouds allow us to understand more about the major topics discussed in each subreddit.
The implications of our method results are as follows:
- Word2vec: The word2vec output tracks with the idea that r/TheRedPill uses vitriolic language to describe women (e.g. the word most linked with “deserve” is “homemaker; the words most associated with “stacy” are “obedience” and “whorish”).
- Sentiment analysis: When viewed in the context of average compound sentiment over time, we observed that r/technews had the most drastic fluctuations in compound sentiment over time, followed by r/feminism, and r/TheRedPill actually had fairly neutral sentiment over the year. This makes sense because while users on the red pill speak very negatively about women, they also use very positive language to try and embolden and encourage other users.
- Topic analysis: Our wordcloud outputs show that the r/feminism and r/technews subreddits use words like “people” more often than r/TheRedPill, which uses more gender-polarized language.
Our work contributes to the social sciences by shedding light on how users on r/TheRedPill describe women, and how sentiment within this subreddit compares to other subreddits. This topic is often avoided given its contentious nature, and we contribute to the field by using large scale computing frameworks to better parse patterns in the comments that can inform what the movement stands for, and where it’s headed.
- Data Collection (PyWren/Local) (Kaylah, Angelica)
- Text Cleaning (Dask) (Kaylah, Angelica, Sruti)
- Comment Tokenizing and Lemmatizing (Local) (Sruti, Kaylah Angelica)
- Sentiment Analysis (PySpark) (Sruti)
- Temporal Trends (Dask) (Kaylah)
- Topic Analysis and Word Clouds (PySpark/Local) (Angelica)
Raw Reddit Data Retrieved from Pushshift.io:
https://files.pushshift.io/reddit/comments/
How to extract subreddit data from raw Reddit data (pushshift.io):
Information on Word2Vec:
https://en.wikipedia.org/wiki/Word2vec
Understanding Sentiment Analysis and Word2Vec on Spark:
https://monkeylearn.com/sentiment-analysis/ https://sofiadutta.github.io/datascience-ipynbs/big-data-analytics/Using_MyClassifier_Twitter_Data_Sentiment_Classification_and_Big_Data_Analytics_on_Spark_Dataframe.html