This goal of this project was to securely ingest, streamline and perform analysis on raw data from Twitter using the Twitter API and further transform the data using an ETL process.
This data pipeline is designed to run daily. The Airflow DAG is responsible for triggering the python script (twitter_etl.py). This ensures that the latest data from the Twitter API is fetched regularly.
Upon successful extraction the ETL process is triggered for specified Twitter user. The data is transformed from JSON format to CSV using a Pandas Dataframe and then the object (CSV) is uploaded to the data lake (AWS S3).
The transformed data can then be accessed using a visualization tool such as Tablaeu, Quicksight, Power BI, or Superset to build dashboards to conduct various types of analysis.
The purpose of this project was to utilize the Twitter API, AWS services, and Airflow to create an automated data pipeline that can efficiently process user/tweet data to be made available for analysis. This project showcases the versatility of AWS service in building robust and automated DE solutions.