Data Engineer Code Challenge

General notes

As the task leaves a lot of space for assumptions and technologies to choose I have decided to go 100% AWS for architectural decisions. Although Google BigQuery seems to be more popular as a big data storage, I was curious to explore possibilities of Redshift.

Other technologies from AWS I am using in this prototype can easily be replaced by similar tools from different brands with a certain ease.

Configuration of connections to AWS resources is placed in settings.ini. In order to connect to AWS resources I am using boto3 library which is getting access key and secret from $HOME/.aws/ folder or from environment variables. But for settings consistency scripts are getting credentials from the same settings file.

Events simulation

To simulate events flow I am using generate_events.py script which has simple settings (from settings.ini) to affect frequency of events generation. Amazon SQS is being used here as an events bus. The generation script assumes the queue already exists.

Events ingestion a.k.a. consumption

Script which consumes messages and stores them on a DFS consume_events.py can be running

non-stop on a hosted machine (self-hosted, EC2, ElasticBeanstalk, Google AppEngine etc.) -- in this case long polling is possible. Such scripts can also run in parallel on several hosts or as several processes on the same machine.
be a serverless solution like Lambda

A decision depends mainly on an expected load. In this implementation I am using a self-hosted script with a configurable long polling.

Tasks scheduling

For a hosted solution the easiest way to support tasks scheduling is cron as it is a natural part of nearly all *nix operation systems. There are plenty of alternatives including cloud solutions like Cloud Scheduler from Google, AWS Batch or CloudWatch. I am using cron here as it is easy to replicate in any other solution.

Upload into DataWarehouse

Disclaimer: I could not fully test this part of the task as I don't have access to a real Redshift cluster.

As Redshift is based on PostgreSQL and is compatible with it's connection drivers, I'm using psycopg2 here.

An assumed table structure can be found in create_table.sql. Columns marked as NOT NULL are either necessary to fulfill the task needs or a logical assumption. VARCHAR length are also an assumtion based on example data.

To keep scheduling flexible the script upload_data.py is fetching all available files from S3, uploads them to Redshift and removes from S3 so the next job will not upload the same data causing unnecessary duplication. In a real project it might be beneficial not to remove data from S3, but keep uploaded files in a separate bucket.

This script is scheduled to run every 5 minutes.

Visualization

Data visualization highly depends on requirements and tools/processes which are already in use by the company. Tools like Tableau often become a choice of bigger companies, other tools like D3 allow in-house development of interactive graphics.

Since no specific information was given in the challenge for this part, I've decided to provide a simple visualization via matplotlib. Each time visualize.py script runs it saves resulting graphs as files in the same directory for further reference. It is scheduled to run evenry 2 hours.

How to run scripts on local

It is possible to run all scripts locally for testing purposes using Docker. I am using a single Docker container which runs several processes, but this approach is not recommended and is definitely not a production solution.

Edit settings.ini file to set up real credentials and other settings
Build docker container with docker build --rm -t taxfix/challenge .
Run docker run -t -i taxfix/challenge

All scripts' output will be printed to the console.

Please note: all AWS resources like S3 bucket, SQS queue, Redshift cluster should be created manually beforehand.

Potential improvements / next steps

add unit tests
improve error handling
move malformed events to a dead-letter queue to analyze
better organisation of files handling for raw events (described above)
choose a better option for visualization or at least move graph images to a DFS
improve logs collection for periodical tasks

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
consume_events.py		consume_events.py
create_table.sql		create_table.sql
crontab		crontab
generate_events.py		generate_events.py
requirements.txt		requirements.txt
script_wrapper.sh		script_wrapper.sh
settings.ini		settings.ini
upload_data.py		upload_data.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Code Challenge

General notes

Events simulation

Events ingestion a.k.a. consumption

Tasks scheduling

Upload into DataWarehouse

Visualization

How to run scripts on local

Potential improvements / next steps

About

Releases

Packages

Languages

catherine-v/taxfix-challenge

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Code Challenge

General notes

Events simulation

Events ingestion a.k.a. consumption

Tasks scheduling

Upload into DataWarehouse

Visualization

How to run scripts on local

Potential improvements / next steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages