The purpose of the project is to build a database and design an ETL pipeline workflow that would power the analytical needs for Sparkify.
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. In order to provide better service to their users, Sparkify would like to bring their data to a data store/database so analysts can perform queries faster, and analyse user patterns in an optimized manner.
The section covers tools used to enable the project.
- Redshift to store data and s3 for backup data
- Airflow to process and carry out workflow orchestration with ETL and Quality data checks
- Python to create the infrastructure and delete the infrastructure
Using the song and log datasets, created a star schema optimized for queries on song play analysis. This includes the following tables listed below. Using this schema majority of data required for analysis can be in few tables, and only few joins may have to be used. This is a common design utilised in industry which maintains a balance of optimum normalisation with enhanced performance. Since the goal is having a database which analysts can use efficiently, focus was on optimisation rather than storage reduction and redundancy. The fact table is distributed using the key across all 4 nodes in the cluster where as dimension tables are duplicated across the 4 nodes in the cluster as these tables are small, and can optimize the speed of analysis.
- songplays - records in log data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users in the app
- user_id, first_name, last_name, gender, level
- songs - songs in music database
- song_id, title, artist_id, year, duration
- artists - artists in music database
- artist_id, name, location, latitude, longitude
- time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday
- airflow - Contains the dags and airflow orchestration code
- creatinginfrastructure.py - Contains the pipeline to automate infrastructure requirements which will create the redshift cluster, role access to read data and load data from s3 into the redshift cluste
- deletinginfrastructure.py - Contains the pipeline to destroy the infratsructure associated with the project.
-
Create the editor role in aws iam
-
Open a jupyter notebook and run the following code twice with input your aws_access_key and secret_key:
import base64 sample_string = input() #input aws key and secret sample_string_bytes = sample_string.encode("ascii") base64_bytes = base64.b64encode(sample_string_bytes) base64_string = base64_bytes.decode("ascii") print(f"Encoded string: {base64_string}")
-
Create a aws_editor.json file with following keys:
{"access_key":"<encoded aws_access_key>", "secret_key":"<encoded aws_secret_key>"}
-
Save the aws_editor.json file on the root user path:
/Users/<systemUser>/aws_editor.json
-
Open terminal
-
Run createConfig.py script to create the config file that will be used to spin the infratstructure and will have details to connect to redshift and insert data into it.
python3 createConfig.py
-
Run the creatingInfrastructure.py to get the infrastructure ready.
python3 creatingInfrastructure.py
-
Launch airflow locally on in a docker or deploy it on a VM (Instructions upcoming in near future).
-
Run the airflowVariablesAndConnections.sh
bash airflowVariablesAndConnections.sh <redshift_iam_arn> <redshift_cluster_endpoint>
-
Run airflow (Instructions upcoming in near future).
Log to the aws console and go to the redshift cluster: (https://us-west-2.console.aws.amazon.com/redshiftv2/home?region=us-west-2#landing "AWS Redshift us-west-2 homepage")
Click on "Query Editor" Enter the login details as in dwh.cfg file
Run the query below
SELECT user_id, start_time, song_id, session_id, level FROM songs.songplays
WHERE session_id=221
AND song_id='SODTRKW12A6D4F9A51'
LIMIT 1;
user_id | start_time | song_id | session_id | level |
---|---|---|---|---|
15 | 2018-11-07 18:11:11 | SODTRKW12A6D4F9A51 | 221 | paid |
-
Run the deletingInfrastructure.py that will destroy the cluster and remove the role and policty attached to the role.
python3 deletingInfrastructure.py
Nakul Bajaj