Skip to content

This repository contains a cloud-based data streaming solution for multi-region sales data

License

Notifications You must be signed in to change notification settings

lloydtawanda/RegionSalesDataStreaming

Repository files navigation

Multi-Region Sales Data Streaming Solution

Description

This is a cloud-based data streaming solution for multi-region sales data.

Architecture Diagram

Image of Architecture design

Prerequisites

  1. Access to Amazon Web Services (AWS) Cloud Computing Services.
  2. Access to Amazon Scalable Storage Service (S3).
    • Create S3 bucket called Data with the following folder structure:
      Image of S3 processed data bucket
    • Create S3 bucket called Streaming to store data streams
    • Create S3 sub-folders in Streaming bucket, called processed and backup
  3. Access to AWS Glue Service
    • Create Glue Database called sales_db that points to S3 bucket prefix Streaming/processed
    • Create Glue Table called sales and use the column names Sales data CSV files to define the table schema
    • Create Glue ETL Job called sales_etl_job and copy and paste script in /PopulateKinesisFirehose/push_data.py into the job's script
  4. Access to Amazon Kinesis Firehose Service.
    • Create Delivery stream using Kinesis Firehose
    • Configure Source to be Direct PUT or other sources
    • Set Record transformation to Disabled
    • Set Record format conversion to Enabled
    • Set Output format to Apache Parquet
    • Set AWS Glue region to EU (Ireland)
    • Set AWS Glue database to _sales_db
    • Set AWS Glue table to sales
    • Set AWS Glue table version to Latest
    • Set Destination to Amazon S3
    • Set S3 bucket to Streaming
    • Set S3 prefix to processed
    • Set S3 error prefix to backup
    • Set Buffer size to 128MB
    • Set Buffer interval to 300s
    • Set S3 encryption to Disabled
    • Set Error logging to Enabled
    • Set Tags (Optional)
    • Create IAM role with access S3 bucket where data streams will be stored, and Glue and Kinesis service role access
    • Assign IAM role to delivery stream
  5. Access to AWS Lambda Service.
    • Create Lambda function called invoke_glue_etl
    • Set Runtime to Python 3.7
    • Assign IAM role with access to Lambda Invoke Service and Glue Service access
    • Assign S3 trigger from Streaming bucket put event
    • Copy and paste code in /Lambda_Function/lambda_function.py to the Lambda function script
  6. Access to AWS Elastic Cloud Compute Service (EC2)
    • Start EC2 instance using AWS Linux AMI
    • Assign IAM role to access kinesis firehose service
    • Install or update python 3
    • Install pandas using pip package manager
    • Create python file called called push_data.py
    • Copy and paste code in /PopulateKinesisFirehose/push_data.py to the file

Limitations and Scaling Options

  • The python program used to preprocess data before pushing to kinesis firehose runs on a single EC2 instance, this requires a large instance when big data comes to perspective. Big data processing on a single-instance architecture will have implications on costs and processing, alternatively auto-scaling or a distributed computing framework such as Apache Spark on EMR platform can be used for data preprocessing.

Assumptions

  1. Region Data is static and is stored in S3 location Data/Region
  2. All sales data is stored in or pushed to EC2 instance as CSV files

About

This repository contains a cloud-based data streaming solution for multi-region sales data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages