This project is part of Udacity's Nanodegree program "Data Engineering"

Purpose

The purpose of this project is to support Sparkify by creating a database which enables them to perform an in-depth songplay analysis. The data sources are various json files residing in S3.

Data sources

Source data resides in S3 and comes in json files of the following formats:

log_data (user attributes and activities like song plays)

{ "artist": "N.E.R.D. FEATURING MALICE",
  "auth": "Logged In",
  "firstName": "Jayden",
  "gender": "M",
  "itemInSession": 0,
  "lastName": "Fox",
  "length": 288.9922,
  "level": "free",
  "location": "New Orleans-Metairie, LA",
  "method": "PUT",
  "page":"NextSong",
  "registration": 1541033612796.0,
  "sessionId": 184,
  "song": "Am I High (Feat. Malice)",
  "status": 200,
  "ts": 1541121934796,
  "userAgent": "\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
  "userId": 101
}

song_data (known songs and their artists)

{ "num_songs": 1, 
  "artist_id": "ARJIE2Y1187B994AB7", 
  "artist_latitude": null, 
  "artist_longitude": null, 
  "artist_location": "", 
  "artist_name": "Line Renaud", 
  "song_id": "SOUPIRU12A6D4FA1E1", 
  "title": "Der Kleine Dompfaff", 
  "duration": 152.92036, 
  "year": 0
}

Strategy

We first convert the JSONs residing in S3 into staging tables and then use these staging tables to fill our final fact and dimension tables.

Schema (final tables)

We follow a star-schema with songplay table being the fact table. This enables us to run performant queries and we are also able to set up meaningful queries with ease.

Fact table

songplays (extracted from log files and linked to artists and songs)

Column	Type	Property
songplay_id	integer	Primary key
start_time	timestamp
user_id	integer	Not Null
level	varchar
song_id	varchar	Not Null
artist_id	varchar	Not Null
session_id	integer	Not Null
location	varchar
user_agent	varchar

Dimension tables

users (users in database)

Column	Type	Property
user_id	integer	Primary key
first_name	varchar
last_name	varchar
gender	varchar
level	varchar

songs (known songs)

Column	Type	Property
song_id	varchar	Primary key
title	varchar
artist_id	varchar	Not Null
year	integer
duration	numeric

artists (known artists)

Column	Type	Property
artist_id	varchar	Primary key
name	varchar
location	varchar
latitude	numeric
longitude	numeric

time (timestamps of song plays split into different units)

Column	Type	Property
start_time	timestamp	Primary key
hour	integer
day	integer
week	integer
month	integer
year	integer
weekday	integer

How to run

Set HOST and ARN in dwh.cfg. Then open a python console and call

run create_tables.py

to create the db tables, after that populate the data into tables:

run etl.py

Have a look at run_select_queries.ipynb to see the first 20 entries of each fact and dimension table.

Files in repository

create_tables.ipnyb | functions to initially create (and drop) tables dwh.cfg | Configuration file to access Redshift, DB and S3 etl.py | etl processes to transform the data residing in S3 into the db schema README.md | this file sql_queries.py | SQL statements to drop, create tables, insert data and selecting run_select_queries.ipnyb | Runs some SELECT queries for testing purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This project is part of Udacity's Nanodegree program "Data Engineering"

Purpose

Data sources

log_data (user attributes and activities like song plays)

song_data (known songs and their artists)

Strategy

Schema (final tables)

Fact table

songplays (extracted from log files and linked to artists and songs)

Dimension tables

users (users in database)

songs (known songs)

artists (known artists)

time (timestamps of song plays split into different units)

How to run

Files in repository

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
run_select_queries.ipynb		run_select_queries.ipynb
sql_queries.py		sql_queries.py

noxthot/udacity_data_warehouse

Folders and files

Latest commit

History

Repository files navigation

This project is part of Udacity's Nanodegree program "Data Engineering"

Purpose

Data sources

log_data (user attributes and activities like song plays)

song_data (known songs and their artists)

Strategy

Schema (final tables)

Fact table

songplays (extracted from log files and linked to artists and songs)

Dimension tables

users (users in database)

songs (known songs)

artists (known artists)

time (timestamps of song plays split into different units)

How to run

Files in repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages