This is a cloud-based data streaming solution for multi-region sales data.
- Access to Amazon Web Services (AWS) Cloud Computing Services.
- Access to Amazon Scalable Storage Service (S3).
- Access to AWS Glue Service
- Create Glue Database called sales_db that points to S3 bucket prefix Streaming/processed
- Create Glue Table called sales and use the column names Sales data CSV files to define the table schema
- Create Glue ETL Job called sales_etl_job and copy and paste script in /PopulateKinesisFirehose/push_data.py into the job's script
- Access to Amazon Kinesis Firehose Service.
- Create Delivery stream using Kinesis Firehose
- Configure Source to be Direct PUT or other sources
- Set Record transformation to Disabled
- Set Record format conversion to Enabled
- Set Output format to Apache Parquet
- Set AWS Glue region to EU (Ireland)
- Set AWS Glue database to _sales_db
- Set AWS Glue table to sales
- Set AWS Glue table version to Latest
- Set Destination to Amazon S3
- Set S3 bucket to Streaming
- Set S3 prefix to processed
- Set S3 error prefix to backup
- Set Buffer size to 128MB
- Set Buffer interval to 300s
- Set S3 encryption to Disabled
- Set Error logging to Enabled
- Set Tags (Optional)
- Create IAM role with access S3 bucket where data streams will be stored, and Glue and Kinesis service role access
- Assign IAM role to delivery stream
- Access to AWS Lambda Service.
- Create Lambda function called invoke_glue_etl
- Set Runtime to Python 3.7
- Assign IAM role with access to Lambda Invoke Service and Glue Service access
- Assign S3 trigger from Streaming bucket put event
- Copy and paste code in /Lambda_Function/lambda_function.py to the Lambda function script
- Access to AWS Elastic Cloud Compute Service (EC2)
- Start EC2 instance using AWS Linux AMI
- Assign IAM role to access kinesis firehose service
- Install or update python 3
- Install pandas using pip package manager
- Create python file called called push_data.py
- Copy and paste code in /PopulateKinesisFirehose/push_data.py to the file
- The python program used to preprocess data before pushing to kinesis firehose runs on a single EC2 instance, this requires a large instance when big data comes to perspective. Big data processing on a single-instance architecture will have implications on costs and processing, alternatively auto-scaling or a distributed computing framework such as Apache Spark on EMR platform can be used for data preprocessing.
- Region Data is static and is stored in S3 location Data/Region
- All sales data is stored in or pushed to EC2 instance as CSV files