Skip to content

Files

Latest commit

 

History

History

3. Data Lakes with Spark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Introduction

A startup called Sparkify has now grown their user base and song database and now want to move their datawarehouse to a datalake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as directory with JSON metadata on song in their application.

Goal

The goal of this project is to creata a data lake but builduing an ETL pipeline that extracts data from S3, processes them using Spark, and load the data back into S3 as a set of modeled tables.

Data Sources

The datasets used in this projects are divided into two different parts:

  1. Song Data: The data is a subset of real data from Million Song Dataset. The files are in JSON format and contains metadata about song and artist of the table. The data resides in S3 and can be accessed using s3://udacity-dend/song_data
  2. Log Data: The second dataset consists of log data in JSON format generated by this event simulator. The data resides in S3 and can be accessed using s3://udacity-dend/log_data

Modeling Tables

  • For the sake of analysis we will create a star schema and load data into facts and dimension table from the above created staging tables
    • The fact table will be:
      • songplays: records in event data associated with song plays
    • The dimension table will be:
      • users-users of the app Sparkify
      • songs-songs in the music database of the application
      • artists-artist in music database of application
      • time-timestamp of the record in songplays table broken down into specific units
  • For detailed design of data types of these tables and primary key please check [here](Link to read me for project 1)

Project Guidelines

  • dl.cfg - The configuration file to store and fetch credentials of AWS components.
  • etl.py - The pipeline script that fetches data from S3 from a particular location and stores it back into S3 as Parquet as modeled tables.

Steps to run this Project

  • There are two ways to run this project:
    • First way is to create an EMR cluster and run etl.py on the cluster in AWS. The credentials for AWS can be updated in dl.cfg.
    • Second way is to run etl.py on local machine which will take a bit longer as compared to running on EMR cluster in AWS.