Purpose: to simplify analytics of Sparkify ("app") user song play activities with a Postgres database, ETL pipeline solution, and the Million Song Dataset ("music database").
Disclaimer: Sparkify is a fictional music streaming app. The Million Song Dataset however is real: http://millionsongdataset.com/ (freely-available collection of audio features and metadata for a million contemporary popular music tracks.)
- Problem statement and proposal
- How to run the Python Scripts
- Explaination of the files in the repository
- Rationale of databse schema design and ETL pipeline
- Example queries and results for the song play analysis
- Appendix
Problem statement: a startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
Proposed solution: create a Postgres database with tables designed to optimize queries on song play analysis, along with a database schema and ETL pipeline. the database and ETL pipeline may be tested by running queries by the analytics team from Sparkify and compare against expected results.
The following processes are repeatable and re-runnable.
- Setup (or re-setup) the Postgres Environment. This step create the
sparkifyeb
(STAR schema) database, and create the initial skeleton empty tables inside it: a fact tablesongplays
, and 4 dimension tables:users
,songs
,artists
,time
.
python create_tables.py
- Run the ETL pipeline to populate the
sparkifydb
tables. This step extract the music app data (data/song_data
) and Sparkify event log data (data/log_data
), apply transformation, and load into thesparkifydb
database (our ultimate output).
python etl.py
- Run basic tests against
sparkifydb
by running the Jupyter Notebooketc.ipynb
from top to bottom. Any errors and warnings should be fixed. In the end we should no longer get any warnings or errors. (ensure to restart kernel at the end when done to free up session). If we have done our job correctly, we should expect to see one record with non-nullsong_id
orartist_id
within thesongplays
table:
songplay_id: NextSong
start_time: 2018-11-21 21:56:47
user_id: 15
level: paid
song_id: SOZCTXZ12AB0182364
artist_id: AR5KOSW1187FB35FF4
session_id: 818
location: Chicago-Naperville-Elgin, IL-IN-WI
user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"
- Run the Example query notebook
example_queries.ipynb
to perform analysis (it contains sample SQL queries here). At the bottom of that notebook you will see a SQL query that contains a full join across all normalised tables, and generate the eventual denormalised view. Make sure to restart the notebook when you are done.
Location: data/song_data
.
This is a subset of real data from the Million Song Dataset (http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are file paths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
I've also developed JSON schema to help visualise structure of raw JSON file (at data_schema/song_schema.json
to help visualise structure of a JSON record in data/song_data
) and potential JSON validation improvement ideas (e.g. data quality checks / bad data quarantine, etc.)
Location: data/log_data
.
Spakify app log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The (daily event) log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
Each daily event log file contains multiple events within that day. Each event is represented by one JSON record (compacted into one line). Below is what a line of JSON may look like (displayed in a pretty format for ease of visualisation)
{
"artist": null,
"auth":"LoggedIn",
"firstName":"Walter",
"gender":"M",
"itemInSession":0,
"lastName":"Frye",
"length":null,
"level":"free",
"location":"San Francisco-Oakland-Hayward, CA",
"method":"GET",
"page":"Home",
"registration":1540919166796.0,
"sessionId":38,
"song":null,
"status":200,
"ts":1541105830796,
"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId":"39"}
I've also developed JSON schema to help visualise structure of raw JSON file (at data_schema/log_schema.json
) to help visualise structure of a JSON record in data/log_data
and potential JSON validation improvement ideas (e.g. data quality checks / bad data quarantine, etc.).
At a high level, we use a STAR schema to model our data. STAR schema tends to promote data integrety and minimise data duplications.
Fact Table: songplay
- this table is at event level. i.e. one line per log event.
Dimension table: these tables are in general more static. These tables provides additional static-like attributes to enrich the Fact Table. We may query these tables individually, as well as joining with other tables for more joined-up analysis.
users
: one row per unique Sparkify user for us to build user centric queries. e.g. what artists or songs a particular userartists
: one row per unique Sparkify user for us to build artist centric queries. e.g. finding out most popular artists at all time, or for a particular period.time
: one row per unique timestamp dimension for us to build time centric queries. e.g. user volume per weekday, or per month, etc.)songs
: one row per unique song us to build song centric queries. e.g. how long is a song, who created it, which year was the release.
This section describes how the raw upstream JSON files are mapped to the eventual Postgres sparkifydb
tables at field level. This part has helped me develop the initial ETL prototype.
<postgres-table-field-name>
:<postgres-table-field-type>
<--<raw-json-file>.<json-field-name>
:<raw-json-field-type>
Notes: we use the following aliases for documentation purposes below.
log
: means this comes from the log JSON files (fromdata/log_data
).log.userId
refers to theuserId
field of thelog
JSON file. And so on.song
: means this comes from the song JSON files (fromdata/log_data
).song.songId
refers to thesongId
field of thelog
JSON file. And so on.- Including psql constraint types and primary key indications (following iteration using
test.ipynb
) - Note to self: we ought to document the Postgres-JSON mapping via configuration files instead (rather then the unwieldy markdown syntax below), and dynamically generate SQL create-table queries that way (Infrastructure as Code). Probably in a more JSON schema fashion. For future!
Fact Table
-
songplays
- records in log data associated with song plays i.e. records with page NextSong. Fields:songplay_id
:VARCHAR
<-- SERIAL PRIMARY KEYstart_time
:TIMESTAMP
(NOT NULL
) <--log.ts
:int
(need convert UNIT Epoch time into PSQL TIMESTAMP type. Transform as string "yyyy-mm-dd HH:MM:SS" so we can parse into PSQL able smoothly)user_id
:BIGINT
<--log.userId
:integer
level
:VARCHAR
<--log.level
:string
song_id
:VARCHAR
<--song.song_id
:string
artist_id
:VARCHAR
<--song.artist_id
:string
session_id
:INT
<--log.sessionId
:integer
location
:VARCHAR
<--log.location
:string
user_agent
:VARCHAR
<--userAgent
:string
...
ON CONCLICT (songplay_id) DO NOTHING
Dimension Tables
-
users
- users in the app. Fields:user_id
:BIGINT
(PRIMARY KEY
) <--log.userId
:integer
first_name
:VARCHAR
<--log.firstName
:string
last_name
:VARCHAR
<--log.lastName
:string
gender
:VARCHAR
<--log.gender
:string
level
:VARCHAR
<--log.level
:string
...
ON CONCLICT (user_id) DO <replace old values with new values>
-
songs
- songs in music database. Fields:song_id
:VARCHAR
(PRIMARY KEY
) <--song.song_id
:string
title
:VARCHAR
(NOT NULL
) <--song.title
:string
artist_id
:VARCHAR
<--song.artist_id
:string
year
:INT
(NOT NULL
) <--song.year
:integer
duration
:DECIMAL
<--song.year
:number
...
ON CONCLICT (song_id) DO NOTHING
-
artists
- artists in music database. Fields:artist_id
:VARCHAR
(PRIMARY KEY
) <--song.artist_id
:string
name
:VARCHAR
(NOT NULL
) <--song.artist_name
:string
location
:VARCHAR
<--song.artist_location
:string
latitude
:DOUBLE PRECISION
<--song.artist_latitude
:number
longitude
:DOUBLE PRECISION
<--song.artist_longitude
:number
...
ON CONCLICT (artist_id) DO NOTHING
-
time
- timestamps of records in songplays broken down into specific units. Fields:start_time
:TIMESTAMP
(PRIMARY_KEY
) <--log.ts
:int
(need convert UNIT Epoch time into PSQL TIMESTAMP type. Transform as string "yyyy-mm-dd HH:MM:Shour
:INT
<-- derived fromstart_time
Get Hour (0-23).day
:INT
<-- derived fromstart_time
Get Date (yyyy-mm-dd).week
:INT
<-- derived fromstart_time
Get Week Number (0-53).month
:INT
<-- derived fromstart_time
Get Month Number (1-12).year
:INT
<-- derived fromstart_time
Get Year (yyyy).weekday
:INT
<-- derived fromstart_time
Get Weekday (1-7).
...
ON CONCLICT (start_time) DO NOTHING
Run the Example query notebook example_queries.ipynb
to perform analysis. At the bottom of that notebook you will see a SQL query that contains a full join across all normalised tables, and generate the eventual denormalised view. Make sure to restart the notebook when you are done.
The following SQL query will create a de-normalised view:
SELECT \
songplays.start_time AS event_start_time \
,songplays.songplay_id AS songplay_id \
,time.year AS event_year \
,time.month AS event_month \
,time.day AS event_day \
,time.hour AS event_hour \
,time.week AS event_week \
,time.week AS event_weekday \
,users.user_id AS user_id \
,users.first_name AS user_first_name \
,users.last_name AS user_last_name \
,users.gender AS user_gender \
,users.level AS user_level \
,songs.song_id AS song_id \
,songs.title AS song_title \
,songs.year AS song_release_year \
,songs.duration AS song_duration \
,artists.artist_id AS artist_id \
,artists.location AS artist_location \
,artists.latitude AS artist_latitude \
,artists.longitude AS artist_longitude \
FROM songplays \
INNER JOIN users \
ON users.user_id = songplays.user_id \
INNER JOIN songs \
ON songs.song_id = songplays.song_id \
INNER JOIN artists \
ON songplays.artist_id = artists.artist_id \
INNER JOIN time \
ON songplays.start_time = time.start_time \
;
Additional texts included below. These helped me developed the initial prototypes.
Review and revise the following componets (listed in order more or less represent the pipeline workflow):
-
psql_sandpit.ipynb
: got a bit rusty using Postgres via the terminal commands? Use this notebook (in conjunction with a terminal) to have a play. e.g. invoke the psql command line interface (CLI); list databases / tables; create/drop database and tables; insert values to tables, etc. -
sql_queries.py
: a helper module imported by the maincreate_tables.py
andetl.py
module. -
create_tables.py
: This step (re-)generate thesparkifydb
database, with the blank skeleton tables created (with zero rows populated):songplays
users
songs
artists
time
-
etl.py
: This step load raw data (fromdata
) to thesparkifydb
Postgres database. During the development phase, developer may use the Jupyter Notebooketl.ipynb
for experimentation and iterate theetl.py
module.
To test out the eventual ETL pipeline, perform the following steps (all re-runnable):
Setup (or re-setup) the Postgres Environment:
python create_tables.py
Run the ETL script to populate all the postgres tables:
python etl.py
Run etc.ipynb
from top to bottom to ensure we are not getting any warnings or errors (ensure to restart kernel at the end when done to free up session). Note the bottom part of the notebook contains a final sanity check:
%sql SELECT * FROM songplays WHERE song_id IS NOT NULL OR artist_id IS NOT NULL;
The project Rubric suggests that there should be one record with non-null song_id
and non-null artist_id
in the songplays
table. (i.e. based on the song and log samples there should only be one sparkify user session overlaps with the Million Song Dataset - join via song title, artist name, and song duration). If we have done our job correctly, we should expect to see one record as per folllowings:
songplay_id: NextSong
start_time: 2018-11-21 21:56:47
user_id: 15
level: paid
song_id: SOZCTXZ12AB0182364
artist_id: AR5KOSW1187FB35FF4
session_id: 818
location: Chicago-Naperville-Elgin, IL-IN-WI
user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"
Opporunity 1: test upstream raw json records.
Run a JSON schema check to ensure each raw JSON record (in data
) aligns to the JSON schema files (in data_schema
). Ingest the good records and quarantine the bad records.
Opportunity 2: test the ETL process.
Test on a record by record basis and confirm that each JSON raw record is parsed into the sparkifydb
tables correctly. Compare expected vs actual values. Log all the bad ETL.
Opportunity 3: mini-batch insert multiple rows rather than 1 row at a time (with SQL COPY).
I have saved a copy of the original Udacity etl.py
as etl_template.py
for future references. Maybe our etl.py
can be made better in future.
Useful links: