Multi-Region Sales Data Streaming Solution

Description

This is a cloud-based data streaming solution for multi-region sales data.

Architecture Diagram

Prerequisites

Access to Amazon Web Services (AWS) Cloud Computing Services.
Access to Amazon Scalable Storage Service (S3).
- Create S3 bucket called Data with the following folder structure:
- Create S3 bucket called Streaming to store data streams
- Create S3 sub-folders in Streaming bucket, called processed and backup
Access to AWS Glue Service
- Create Glue Database called sales_db that points to S3 bucket prefix Streaming/processed
- Create Glue Table called sales and use the column names Sales data CSV files to define the table schema
- Create Glue ETL Job called sales_etl_job and copy and paste script in /PopulateKinesisFirehose/push_data.py into the job's script
Access to Amazon Kinesis Firehose Service.
- Create Delivery stream using Kinesis Firehose
- Configure Source to be Direct PUT or other sources
- Set Record transformation to Disabled
- Set Record format conversion to Enabled
- Set Output format to Apache Parquet
- Set AWS Glue region to EU (Ireland)
- Set AWS Glue database to _sales_db
- Set AWS Glue table to sales
- Set AWS Glue table version to Latest
- Set Destination to Amazon S3
- Set S3 bucket to Streaming
- Set S3 prefix to processed
- Set S3 error prefix to backup
- Set Buffer size to 128MB
- Set Buffer interval to 300s
- Set S3 encryption to Disabled
- Set Error logging to Enabled
- Set Tags (Optional)
- Create IAM role with access S3 bucket where data streams will be stored, and Glue and Kinesis service role access
- Assign IAM role to delivery stream
Access to AWS Lambda Service.
- Create Lambda function called invoke_glue_etl
- Set Runtime to Python 3.7
- Assign IAM role with access to Lambda Invoke Service and Glue Service access
- Assign S3 trigger from Streaming bucket put event
- Copy and paste code in /Lambda_Function/lambda_function.py to the Lambda function script
Access to AWS Elastic Cloud Compute Service (EC2)
- Start EC2 instance using AWS Linux AMI
- Assign IAM role to access kinesis firehose service
- Install or update python 3
- Install pandas using pip package manager
- Create python file called called push_data.py
- Copy and paste code in /PopulateKinesisFirehose/push_data.py to the file

Limitations and Scaling Options

The python program used to preprocess data before pushing to kinesis firehose runs on a single EC2 instance, this requires a large instance when big data comes to perspective. Big data processing on a single-instance architecture will have implications on costs and processing, alternatively auto-scaling or a distributed computing framework such as Apache Spark on EMR platform can be used for data preprocessing.

Assumptions

Region Data is static and is stored in S3 location Data/Region
All sales data is stored in or pushed to EC2 instance as CSV files

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
GlueETLJob		GlueETLJob
Lambda_Function		Lambda_Function
PopulateKinesisFirehose		PopulateKinesisFirehose
LICENSE		LICENSE
README.md		README.md
SalesStreamingArchitecture.png		SalesStreamingArchitecture.png
architecture.png		architecture.png
s3_folder_structure.png		s3_folder_structure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Region Sales Data Streaming Solution

Description

Architecture Diagram

Prerequisites

Limitations and Scaling Options

Assumptions

About

Releases

Packages

Languages

License

lloydtawanda/RegionSalesDataStreaming

Folders and files

Latest commit

History

Repository files navigation

Multi-Region Sales Data Streaming Solution

Description

Architecture Diagram

Prerequisites

Limitations and Scaling Options

Assumptions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages