Immigration Analytical Engine

The purpose of this project to create an open analytical database for consumption by US Government so they are able to understand immigration patterns.

Motivation

The analytical team wants to assist stake holders to make and monitor policy decisions around immigration so that industries dependent on immigration such as tourism, business and international student industry are able to strive successfully. The immigration patterns can further help in making infrastructure based decisions. The analytical database should serve daily to long-term decision making needs. The team is typically interested to know how people travel to USA, for what purpose, from where and in which season. They would also want to know do people coming from overseas have a preference where they wanna live and is it dependent on demographic profile of the region.

Data Sources

I94 Immigration Data: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace.
World Temperature Data: This dataset came from Kaggle. You can read more about it here.
U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here.
Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from here.

Built With

The section covers tools used to enable the project.

Amazon EMR process data and output to s3
PySpark to process and carry out ETL
Bash to create the infrastructure and delete the infrastructure

Tables and Dictionaries

Located in dataDictionary.txt here

Data Model

Project Files and Folders

etl.py - Contains all the logic to extract data from S3 and process data on spark and then load data as parquet files into the s3 folder and region specified by user.
sparkSubmit.sh - Shell file that will get executed on the emr cluster and will call the spark submit on the cluster and will run the etl.py
createCluster.sh - Contains the pipeline to automate infrastructure requirements which will create the emr cluster, load the etl.py file on the cluster and load data from udacity s3 into user specific s3 bucket
terminateCluster.sh - Contains the pipeline to destroy the infratsructure associated with the project.

Extra files

Captsone Project Template.ipynb contains all the exploring, building the the data pipeline steps, including quality checks and answers all the project related questions in more detail.

helperFunctions.py has some functions that will be used by spark to run jobs.

i94MetadataMappings.py has various mappings of fields and codes that are used by immigration. These mappings are leveraged in spark job to perform etl and build some dimension tables

qualityTests.py has quality tests that are used by etl to assess quality of the data, so no poor quality data is created

Running the ETL pipeline

Create the editor role in aws iam
Configure the aws cli with the editor role access key and secret.
Create the ssh key pair for ec2 instance using aws cli, give it a name such as my-key-pair. Make sure key is stored in root directory and is in the same region in which emr cluster/ec2 instances will be created. aws ec2 create-key-pair --key-name my-key-pair --query "KeyMaterial" --output text > my-key-pair.pem
If you're connecting to your instance from a Linux computer, it is recommended that you use the following command to set the permissions of your private key file so that only you can read it. chmod 400 MyKeyPair.pem
Open terminal
Run createCluster.sh script to create the emr cluster and execute the spark job on the cluster .Pass the cluster name as first choice of argument and name of the key associated with ec2 instance

bash createCluster.sh <cluster_name> <keyName>

Destroying the infrastructure to avoid costs

Run the terminateCluster.sh that will terminate the cluster

bash terminateCluster.sh

Contact for questions or support

Future State

Enhance the project and add another step in pipeline by pushing data to redshift cluster from s3.

Nakul Bajaj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Immigration Analytical Engine

Motivation

Data Sources

Built With

Tables and Dictionaries

Data Model

Project Files and Folders

Extra files

Running the ETL pipeline

Destroying the infrastructure to avoid costs

Contact for questions or support

Future State

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Capstone Project Template.ipynb		Capstone Project Template.ipynb
DataModel.png		DataModel.png
I94_SAS_Labels_Descriptions.SAS		I94_SAS_Labels_Descriptions.SAS
README.md		README.md
createCluster.sh		createCluster.sh
dataDictionary.txt		dataDictionary.txt
etl.py		etl.py
helperFunctions.py		helperFunctions.py
i94MetadataMappings.py		i94MetadataMappings.py
qualityTests.py		qualityTests.py
sparkSubmit.sh		sparkSubmit.sh
terminateCluster.sh		terminateCluster.sh

Nakulbajaj101/dataEngineeringCapstoneImmigration

Folders and files

Latest commit

History

Repository files navigation

Immigration Analytical Engine

Motivation

Data Sources

Built With

Tables and Dictionaries

Data Model

Project Files and Folders

Extra files

Running the ETL pipeline

Destroying the infrastructure to avoid costs

Contact for questions or support

Future State

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages