Data Ingestion Project

Introduction

Simple ETL skeleton is designed to ingest updated data from both the same and different Excel files. The PySpark library serves as a bridge to read data from Excel files and ingest it into a PostgreSQL database. Additionally, the ETL pipeline is automated using Apache Airflow as the scheduler.

Requirements

Ubuntu Server (WSL)
Java 8
Python3
PostgreSQL or any other open-soure database
JDBC driver 42.7.3
Libraries listed in requirements.txt

Usage

Create a Virtual Environment: Set up a virtual environment and install the necessary packages from requirements.txt.
Install Airflow: Install Airflow using PyPI by following the official installation guide. In this project, Airflow logs are stored in the same working directory for debugging purposes.
Ensure Data Source Availability: Prepare your data source. In this project, dummy data was created in Excel with created_date and updated_date columns to capture updated data for later use.
Create ETL Script: Write a Python script to read, transform, and ingest data into PostgreSQL.
Add Debugging Column: Add a new column ingestion_timestamp to your table for debugging purposes.
Create DAG: Set up an Airflow DAG to automate the entire ETL process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Ingestion Project

Introduction

Requirements

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Ingestion Project

Introduction

Requirements

Usage