Pulse of the Land tracks geographic areas (states and cities) throughout the United States based on sentiment analysis and topic modeling using posts from location-based subreddits on Reddit as well as demographic characteristics such as income and population from the census.
Everything in this project is scripted using Python.
- GeoPandas
- Jupyter Notebooks
- PRAW API
- PSAW API
- Google Maps API
- AWS
- EC2
- S3
- Route 53
Data for the sentiment analysis and topic modeling is obtained from city and state location-based Reddit forums aka subreddits throughout the United States via the Pushshift.io API wrapper (PSAW). Only locations with populations over 50,000 and over 1,000 subreddit subscribers are included. Metadata for initial subreddit subscriber count and selection is accessed via the Python Reddit API wrapper (PRAW)
- 51 states including District of Columbia
- 235 cities
The demographic data comes from:
- Census - population (notebook)
- American Community Survey - median income (notebook)
Coordinates are retrieved using Google Maps API via the googlemaps Python client library. (notebook)
Maps are generated using the GeoPandas library. (notebook)
Sentiment analysis is performed using CountVectorizer with VADER.
Topic modeling is performed using TextBlob.
The rating system uses a propietary score based on the following charactersitics:
- Sentiment
- Income
- Population
The json files are loaded into MongoDB.
The aggregated data including sentiment, topic modeling and scores are stored in PostgreSQL.
A total of ten tables are used in the PostgreSQL schema. (notebook)
- states
- cities
- topics
- keywords
- topics_keywords
- topics_geo
- models
- states_archive
- cities_archive
- topics_archive
pulseoftheland.com is published using the web application framework Flask. It is then scraped internally using wget and the static files are uploaded to a public AWS S3 bucket
The above process using the previous three month's data is scheduled to run on a daily basis:
- Retrieve latest reddit data
- Load into MongoDB
- Run sentiment analysis
- Run topic modeling
- Generate maps
The web app runs on AWS.
- Private S3 bucket stores the Reddit json files
- Public S3 bucket hosts static HTML files
- Python
- scikit-learn
- textblob
- MongoDB
- PostgreSQL