Pulse of the Land

Introduction

Pulse of the Land tracks geographic areas (states and cities) throughout the United States based on sentiment analysis and topic modeling using posts from location-based subreddits on Reddit as well as demographic characteristics such as income and population from the census.

Tools

Python

Everything in this project is scripted using Python.

GeoPandas
Jupyter Notebooks

APIs

PRAW API
PSAW API
Google Maps API

Architecture

AWS
- EC2
- S3
- Route 53

Data

Reddit

Data for the sentiment analysis and topic modeling is obtained from city and state location-based Reddit forums aka subreddits throughout the United States via the Pushshift.io API wrapper (PSAW). Only locations with populations over 50,000 and over 1,000 subreddit subscribers are included. Metadata for initial subreddit subscriber count and selection is accessed via the Python Reddit API wrapper (PRAW)

51 states including District of Columbia
- notebook (raw)
- script (clean)
235 cities
- notebook (raw)
- script (clean)

Census

The demographic data comes from:

Census - population (notebook)
American Community Survey - median income (notebook)

Mapping

Coordinates are retrieved using Google Maps API via the googlemaps Python client library. (notebook)

Maps are generated using the GeoPandas library. (notebook)

Analysis

Sentiment Analysis

Sentiment analysis is performed using CountVectorizer with VADER.

Topic Modeling

Topic modeling is performed using TextBlob.

Rating System

The rating system uses a propietary score based on the following charactersitics:

Sentiment
Income
Population

Databases

MongoDB

The json files are loaded into MongoDB.

PostgreSQL

The aggregated data including sentiment, topic modeling and scores are stored in PostgreSQL.

Tables

A total of ten tables are used in the PostgreSQL schema. (notebook)

states
cities
topics
keywords
topics_keywords
topics_geo
models
states_archive
cities_archive
topics_archive

Web App

pulseoftheland.com is published using the web application framework Flask. It is then scraped internally using wget and the static files are uploaded to a public AWS S3 bucket

Workflow

The above process using the previous three month's data is scheduled to run on a daily basis:

Retrieve latest reddit data
Load into MongoDB
Run sentiment analysis
Run topic modeling
Generate maps

Architecture

The web app runs on AWS.

S3

Private S3 bucket stores the Reddit json files
Public S3 bucket hosts static HTML files

EC2

Python
- scikit-learn
- textblob
MongoDB
PostgreSQL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pulse of the Land

Introduction

Tools

Python

APIs

Architecture

Data

Reddit

Census

Mapping

Analysis

Sentiment Analysis

Topic Modeling

Rating System

Databases

MongoDB

PostgreSQL

Tables

Web App

Workflow

Architecture

S3

EC2

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pulse of the Land

Introduction

Tools

Python

APIs

Architecture

Data

Reddit

Census

Mapping

Analysis

Sentiment Analysis

Topic Modeling

Rating System

Databases

MongoDB

PostgreSQL

Tables

Web App

Workflow

Architecture

S3

EC2