- Ramon Almeida
- Derek Tak
- Ghazala Rehman
- Subha Vivekanandan
- May Alavi
The client's business is experiencing issues with collating and analysing the data they are producing at each branch, as their technical setup is limited. The software currently being used only generates reports for single branches which is time consuming to collate data on all branches. Gathering meaningful data for the company on the whole is difficult, due to the limitations of the software The company currently has no way of identifying trends, meaning they are potentially losing out on major revenue streams.
A fully scalable ETL (Extract, Transform, Load) pipeline to handle large volumes of transaction data for the business. This pipeline is to collect all the transaction data generated by each individual café and place it in a single location
The client will have 3 csv files uploaded to AWS, at 8pm everyday, for every branch, which is being stored into the Cafe Data S3Bucket. The tranformation Lambda will be triggered by the S3 event which is the cafe file being uploaded to the bucket. The Lambda will extract the csvs and will transform them. It will then send the transformed data to the Transformed Data S3Bucket. As soon as the transformed data is uploaded to the Transformed Data Bucket, the this will constitue another S3 event which will trigger the load Lambda which will load the data into Redshift. The client will then be able to visualise, query and monitor the date using Grafana and Metabase.
-Lambda: enabled us to apply custom logic to the pipeline and works in real-time.
-S3 bucket: a durable and easy to navigate tool which enabled us to maintain our pipeline entirely on AWS.
-Redshift: a scalable and cost-effective database which was appropriate for this pipeline as the data was continuously growing.
-EC2: enabled containers consisting of external technologies to be hosted and accessed by all group members
-Grafana: allows for bringing multiple data sources into one location which is why we used it for lambda metrics however it was not as effective in connecting to redshift.
-Metabase: easy to navigate and user-friendly tool to visualise the redshift database by using integrated SQL queries.
-Having to rewrite our structure and code entirely at the end of Sprint1.
-Not having enough time for consistent and regular code reviews.
-Group struggled in ensuring all members get sufficient exposure and practice with the key elements of the pipeline equally.
-Struggled with using GitHub in a consistent manner and as was intended for the project.
-Implementing a queue system.
-Using grafana for data visualisation successfully.
-Optimising the Lambda code.
-Using a test-driven approach.