A pipeline to pull data from S3 and process using Polars, Delta-RS and DuckDB
With the new release of Polars 0.19.x
, I decided to update the ETL pipeline
from DuckDB - Polars - PyArrow - Deltalake
to solely
PyArrow - Polars- Deltalake
. Importing and exporting parquet file with pyarrow.parquet
is to reduce the time
consumption instead of DuckDB.
Two new files were added in src
folder: polars-ingestion
and polars-run
. The old ETL pipeline is renamed
to duckdb-ingestion
and duckdb-run
.
In this duckDB
pipeline, we will use polars
, duckdb
and pyarrow
as main data stack
with the support of delta-rs
as the table format. We also have AWS S3 as object storage for our dataset.
The combined data stack can act as the performant option in low-latency ETLs on small to medium-size datasets.
- Reading the parquet format data from S3 using
duckdb
. - Transforming data to DeltaTable using
pyarrow
. - Performing
compact
andz-order
optimization usingdelta-rs
. - Using
polars
to scan delta table and doing analysis. - Writing the result file (in parquet format) back to S3
This pipeline is pretty similar to duckdb
pipeline with pyarrow.parquet
replacing duckdb
in sections of importing
and exporting parquet file back and forth to AWS S3. This improves the runtime of pipeline with *
*relatively around 20-30 seconds (on my system)**.
- Delta-RS written in Rust and binding to Python provides low-level access to Delta tables which can be used in data processing framework.
- DuckDB is an open-source in-process SQL OLAP database management system which catches a lot of attention recently. DuckDB is designed to run complex SQL queries within other processes.
- Part of Apache Arrow is an in-memory data format optimized for analytical process. Together with DuckDB, the integration between them can provide zero-copy streaming data to many formats and interchange between various language library.
- Polars is a highly performant DataFrame library for manipulating structured data. The core is written in Rust, but the library is available in Python. Polars comes with a vectorized query engine which is for data processing manner.
This project also uses Python 3.10.10 and using poetry
as package management.
To run the pipeline, you need to provide a .env
file (located in the root folder) looking like this:
S3_BUCKET=
LOCAL_FILE_PATH=
AWS_DEFAULT_REGION=
AWS_SECRET_ACCESS_KEY=
AWS_ACCESS_KEY_ID=
The dataset contains retail data can be downloaded from link. This file is stored in CSV file and in order to transform it to Parquet, you can follow this snippet:
import duckdb
conn = duckdb.connect()
conn.execute(
"""
COPY (SELECT * FROM read_csv_auto(
'link-to-download-file'))
TO 'data.parquet' (FORMAT 'parquet');
"""
)
To replicate, you can clone this project and run these commands.
git clone https://github.com/andreale28/Polars-Analysis.git
cd Polars-Analysis
# install poetry
curl -sSL https://install.python-poetry.org | python3 -
# run poetry
poetry install
# run pipeline
python3 -m duckdb_run.py
To run Docker container, you can run Dockerfile following these commands.
docker build . -t polars-analysis
docker run --entrypoint /bin/bash -it polars-analysis
cd script
python -m main
Contributions are welcome. If you would like to contribute you can help by opening a Github issue or putting up a PR.