The purpose of this project is to support Sparkify by creating a database which enables them to perform an in-depth songplay analysis. The data sources are various json files residing in S3.
Source data resides in S3 and comes in json files of the following formats:
{ "artist": "N.E.R.D. FEATURING MALICE",
"auth": "Logged In",
"firstName": "Jayden",
"gender": "M",
"itemInSession": 0,
"lastName": "Fox",
"length": 288.9922,
"level": "free",
"location": "New Orleans-Metairie, LA",
"method": "PUT",
"page":"NextSong",
"registration": 1541033612796.0,
"sessionId": 184,
"song": "Am I High (Feat. Malice)",
"status": 200,
"ts": 1541121934796,
"userAgent": "\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId": 101
}
{ "num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}
We first convert the JSONs residing in S3 into staging tables and then use these staging tables to fill our final fact and dimension tables.
We follow a star-schema with songplay table being the fact table. This enables us to run performant queries and we are also able to set up meaningful queries with ease.
Column | Type | Property |
---|---|---|
songplay_id | integer | Primary key |
start_time | timestamp | |
user_id | integer | Not Null |
level | varchar | |
song_id | varchar | Not Null |
artist_id | varchar | Not Null |
session_id | integer | Not Null |
location | varchar | |
user_agent | varchar |
Column | Type | Property |
---|---|---|
user_id | integer | Primary key |
first_name | varchar | |
last_name | varchar | |
gender | varchar | |
level | varchar |
Column | Type | Property |
---|---|---|
song_id | varchar | Primary key |
title | varchar | |
artist_id | varchar | Not Null |
year | integer | |
duration | numeric |
Column | Type | Property |
---|---|---|
artist_id | varchar | Primary key |
name | varchar | |
location | varchar | |
latitude | numeric | |
longitude | numeric |
Column | Type | Property |
---|---|---|
start_time | timestamp | Primary key |
hour | integer | |
day | integer | |
week | integer | |
month | integer | |
year | integer | |
weekday | integer |
Set HOST and ARN in dwh.cfg. Then open a python console and call
run create_tables.py
to create the db tables, after that populate the data into tables:
run etl.py
Have a look at run_select_queries.ipynb to see the first 20 entries of each fact and dimension table.
create_tables.ipnyb | functions to initially create (and drop) tables dwh.cfg | Configuration file to access Redshift, DB and S3 etl.py | etl processes to transform the data residing in S3 into the db schema README.md | this file sql_queries.py | SQL statements to drop, create tables, insert data and selecting run_select_queries.ipnyb | Runs some SELECT queries for testing purposes.