3. Data Lakes with Spark

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

README.md

Introduction

A startup called Sparkify has now grown their user base and song database and now want to move their datawarehouse to a datalake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as directory with JSON metadata on song in their application.

Goal

The goal of this project is to creata a data lake but builduing an ETL pipeline that extracts data from S3, processes them using Spark, and load the data back into S3 as a set of modeled tables.

Data Sources

The datasets used in this projects are divided into two different parts:

Song Data: The data is a subset of real data from Million Song Dataset. The files are in JSON format and contains metadata about song and artist of the table. The data resides in S3 and can be accessed using s3://udacity-dend/song_data
Log Data: The second dataset consists of log data in JSON format generated by this event simulator. The data resides in S3 and can be accessed using s3://udacity-dend/log_data

Modeling Tables

For the sake of analysis we will create a star schema and load data into facts and dimension table from the above created staging tables
- The fact table will be:
  - songplays: records in event data associated with song plays
- The dimension table will be:
  - users-users of the app Sparkify
  - songs-songs in the music database of the application
  - artists-artist in music database of application
  - time-timestamp of the record in songplays table broken down into specific units
For detailed design of data types of these tables and primary key please check [here](Link to read me for project 1)

Project Guidelines

dl.cfg - The configuration file to store and fetch credentials of AWS components.
etl.py - The pipeline script that fetches data from S3 from a particular location and stores it back into S3 as Parquet as modeled tables.

Steps to run this Project

There are two ways to run this project:
- First way is to create an EMR cluster and run etl.py on the cluster in AWS. The credentials for AWS can be updated in dl.cfg.
- Second way is to run etl.py on local machine which will take a bit longer as compared to running on EMR cluster in AWS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

3. Data Lakes with Spark

3. Data Lakes with Spark

README.md

Introduction

Goal

Data Sources

Modeling Tables

Project Guidelines

Steps to run this Project

Files

3. Data Lakes with Spark

Directory actions

More options

Directory actions

More options

Latest commit

History

3. Data Lakes with Spark

Folders and files

parent directory

README.md

Introduction

Goal

Data Sources

Modeling Tables

Project Guidelines

Steps to run this Project