Sparkify Music Analytical Engine

The purpose of the project is to build a database and design an ETL pipeline workflow that would power the analytical needs for Sparkify.

Motivation

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. In order to provide better service to their users, Sparkify would like to bring their data to a data store/database so analysts can perform queries faster, and analyse user patterns in an optimized manner.

Built With

The section covers tools used to enable the project.

Redshift to store data and s3 for backup data
Airflow to process and carry out workflow orchestration with ETL and Quality data checks
Python to create the infrastructure and delete the infrastructure

Schema for Song Play Analysis

Using the song and log datasets, created a star schema optimized for queries on song play analysis. This includes the following tables listed below. Using this schema majority of data required for analysis can be in few tables, and only few joins may have to be used. This is a common design utilised in industry which maintains a balance of optimum normalisation with enhanced performance. Since the goal is having a database which analysts can use efficiently, focus was on optimisation rather than storage reduction and redundancy. The fact table is distributed using the key across all 4 nodes in the cluster where as dimension tables are duplicated across the 4 nodes in the cluster as these tables are small, and can optimize the speed of analysis.

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, latitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Files and Folders

airflow - Contains the dags and airflow orchestration code
creatinginfrastructure.py - Contains the pipeline to automate infrastructure requirements which will create the redshift cluster, role access to read data and load data from s3 into the redshift cluste
deletinginfrastructure.py - Contains the pipeline to destroy the infratsructure associated with the project.

Running the ETL pipeline

Create the editor role in aws iam

Open a jupyter notebook and run the following code twice with input your aws_access_key and secret_key:

import base64 

sample_string = input() #input aws key and secret
sample_string_bytes = sample_string.encode("ascii") 
base64_bytes = base64.b64encode(sample_string_bytes) 

base64_string = base64_bytes.decode("ascii")
print(f"Encoded string: {base64_string}")

Create a aws_editor.json file with following keys:

{"access_key":"<encoded aws_access_key>",
"secret_key":"<encoded aws_secret_key>"}

Save the aws_editor.json file on the root user path: /Users/<systemUser>/aws_editor.json
Open terminal
Run createConfig.py script to create the config file that will be used to spin the infratstructure and will have details to connect to redshift and insert data into it.

python3 createConfig.py
Run the creatingInfrastructure.py to get the infrastructure ready.

python3 creatingInfrastructure.py
Launch airflow locally on in a docker or deploy it on a VM (Instructions upcoming in near future).
Run the airflowVariablesAndConnections.sh

bash airflowVariablesAndConnections.sh <redshift_iam_arn> <redshift_cluster_endpoint>
Run airflow (Instructions upcoming in near future).

Example of fetching data from database

Log to the aws console and go to the redshift cluster: (https://us-west-2.console.aws.amazon.com/redshiftv2/home?region=us-west-2#landing "AWS Redshift us-west-2 homepage")

Click on "Query Editor" Enter the login details as in dwh.cfg file

Run the query below

SELECT user_id, start_time, song_id, session_id, level FROM songs.songplays 
WHERE session_id=221
AND song_id='SODTRKW12A6D4F9A51'
LIMIT 1;

user_id	start_time	song_id	session_id	level
15	2018-11-07 18:11:11	SODTRKW12A6D4F9A51	221	paid

Destroying the infrastructure to avoid costs

Run the deletingInfrastructure.py that will destroy the cluster and remove the role and policty attached to the role.

python3 deletingInfrastructure.py

Screenshots

Contact for questions or support

Nakul Bajaj

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
airflow		airflow
README.md		README.md
airflowVariablesAndConnections.sh		airflowVariablesAndConnections.sh
createConfig.py		createConfig.py
creatingInfrastructure.py		creatingInfrastructure.py
dag.png		dag.png
dagNameAndStatus.png		dagNameAndStatus.png
deletingInfrastructure.py		deletingInfrastructure.py
subdagQualityChecks.png		subdagQualityChecks.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Music Analytical Engine

Motivation

Built With

Schema for Song Play Analysis

Fact Table

Dimension Tables

Files and Folders

Running the ETL pipeline

Example of fetching data from database

Destroying the infrastructure to avoid costs

Screenshots

Contact for questions or support

About

Releases

Packages

Languages

Nakulbajaj101/sparkifyAirflow

Folders and files

Latest commit

History

Repository files navigation

Sparkify Music Analytical Engine

Motivation

Built With

Schema for Song Play Analysis

Fact Table

Dimension Tables

Files and Folders

Running the ETL pipeline

Example of fetching data from database

Destroying the infrastructure to avoid costs

Screenshots

Contact for questions or support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages