A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to.
The first step is to download or clone this Project repo.
You need to have git
installed to clone this repo. Please follow this link for installation instructions depending on your OS.
Cloning via HTTPS
$ git clone https://github.com/marcusreaiche/de-udacity-nanodegree-project-03.git
Cloning via SSH
$ git clone [email protected]:marcusreaiche/de-udacity-nanodegree-project-03.git
We use miniconda3
to define the Python environment. Please visit this link for installation instructions.
The conda.yml
file lists the Project dependencies
name: data-warehouse
channels:
- conda-forge
- defaults
dependencies:
- python=3.9.18
- pandas=2.1.1
- awswrangler=3.4.1
- boto3=1.28.70
- botocore=1.31.70
- jupyter=1.0.0
- sqlalchemy=1.4.49
- psycopg2=2.9.7
- psycopg2-binary=2.9.7
- ipython-sql=0.3.9
- pip
- pip:
- python-dotenv==1.0.0
Run the following command to create a conda environment.
$ conda env create -f conda.yml
Then, activate it by running:
$ conda activate data-warehouse
You need an AWS account to run this Project.
When using boto3
, the AWS credentials and config files are automatically lookep up in the directory:
~/.aws
You can define environment variables in the .env
file to change the default location of these files. See an example in the example.env
file.
Note that a .env
file will be automatically created in case you do not create one in the project root directory. In this case, the environment variables will be automatically set to their default values:
AWS_SHARED_CREDENTIALS_FILE=~/.aws/credentials
AWS_CONFIG_FILE=~/.aws/config
It is better to explicitly create a .env
file in the project root, though. For performance resons, it is also recommended to set the region value in the AWS config file to 'us-west-2'. The reason for that is that this is the same region of the S3 resources used in this Project.
Here is an example of an AWS credentials file:
[default]
aws_access_key_id = <your AWS access key ID>
aws_secret_access_key = <your AWS secret access key>
and of an AWS config file:
[default]
region = us-west-2
output = json
Project directory tree
├── LICENSE
├── README.md
├── analytics_queries.py
├── conda.yml
├── config.py
├── create_redshift_cluster.py
├── create_tables.py
├── delete_redshift_cluster.py
├── dwh.cfg
├── etl.py
├── example.env
├── imgs
│ ├── number_of_rows.png
│ ├── sparkify.png
│ └── star_schema.png
├── run_pipeline.py
├── sql_queries.py
└── utils
├── __init__.py
├── helpers.py
└── manage_aws_resources.py
3 directories, 19 files
The Project has 5 scripts:
-
create_redshift_cluster.py
Creates a Redshift cluster as specified in the
dwh.cfg
file. It also creates an IAM role used by the Redshift cluster. This script dynamically fills in thedb_host
and thearn
keys, respectively, of theCLUSTER_CREDENTIALS
andIAM_ROLE
sections in thedwh.cfg
file. -
create_tables.py
Creates staging tables, i.e.,
staging_events
andstaging_songs
. Creates a star schema with the fact tablesongplays
and the dimension tablesusers
,songs
,artists
, andtime
. Before creating these tables, the script deletes any of them if they exist. -
etl.py
Perform ETL by loading staging tables and doing insertions in the star schema tables.
-
analytics_queries.py
Execute examples of analytics queries that one would like to do in the Data Warehouse.
-
delete_redshift_cluster.py
Delete the Redshift cluster and the IAM role.
Please check the scripts doc strings for more information.
Define all the SQL queries of the Project.
The dwh.cfg
file sets the Project configuration. It has four sections:
- CLUSTER_CREDENTIALS
Defines the necessary variables to create and connect to the Redshift cluster.
- CLUSTER_INFRA
Defines the cluster type (
multi-node
orsingle-node
), the node type, and the number of nodes. - IAM_ROLE
Defines the name of the IAM_ROLE
- S3
Defines the S3 paths to the logs and songs data. It also defines the log json path.
Have a look at dwh.cfg.
We also implemented the Config
class in the config
module (see config.py
) to make it easier to interact with the Project configuration.
utils
├── __init__.py
├── helpers.py
└── manage_aws_resources.py
-
utils.helpers.py
Helper functions used in the Project.
-
utils.manage_aws_resources.py
Functions that help manage, create, and delete AWS resources.
Please check the modules' doc strings for more information.
Other files that compose the Project are:
- example.env
- .gitignore
- LICENSE
- .editorconfig
- imgs directory
Contains the PNG files displayed in README
Any script can be run like
$ python <script_name>.py
However, one needs only to execute the run_pipeline
script in order to execute any combination of scripts. This script has the optional argument --steps
that allows the specification of which steps will be run.
Steps
- create_redshift_cluster
- create_tables
- etl
- analytics_queries
- delete_redshift_cluster
By default all steps are run when executing:
$ python run_pipeline.py
To execute steps 1 to 4, just run:
$ python run_pipeline.py --steps 1,2,3,4
This will not delete the Redshift cluster.
If one wants to create and delete the Redshift cluster, just run:
$ python run_pipeline.py --steps 1,5
The order of the steps matter and each comma-separated step is run one after the other.
The purpose of the data warehouse is to clean, preprocess, and prepare Sparkify's log data to be analyzed by the company's analytics team in an efficient way. This task is performed taking into consideration optimization for analytics queries instead of transactional ones. In this regard, a star schema was designed with a fact table being surrounded by dimension tables. In this way, we avoid performing deep and complicated joins to fetch the data since it is not presented in 3NF. Another advantage is not performing big analytics queries over the OLTP database to avoid overloading the transactional system.
The songs database follows a star schema with a fact table, songplays, surrounded by dimension tables.
Fact table
- songplays
records in event data associated with song plays, i.e., records with page
NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension tables
- users:
users in the app
- user_id, first_name, last_name, gender, level
- songs
songs in the music database
- song_id, title, artist_id, year, duration
- artists
artists in music database
- artists_id, name, location, latitude, longitude
- time
timestamp of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday
In this way, the analytics team can intuitively and easily create and run efficient queries against the database.
In the script analytics_queries.py
a few examples of analytics queries are presented. Run the script to answer the questions listed in the sequel.
This is done mostly for checking purposes.
The following result is expected:
We want to answer the following questions:
- What is the users distribution by gender?
- What is the users distribution by level (free/paid)?
The last question is particularly interesting since we discover that there are many more free users than paid users in the app.
We want to answer the following questions:
- What is the number of song plays distribution by user level?
- What is the number of song plays distribution by hour?
Even though there are almost four times more free users than paid ones, the latter group is responsible for playing more than four times more songs than the former.