Skip to content

An ELT framework based on DBT and DuckDB to process BGCFlow output tables

Notifications You must be signed in to change notification settings

NBChub/bgcflow_dbt-duckdb

Repository files navigation

BGCFlow dbt-duckdb implementation

An ELT framework to build a DuckDB OLAP database from BGCFlow run using dbt.

Publication

Matin Nuhamunada, Omkar S Mohite, Patrick V Phaneuf, Bernhard O Palsson, Tilmann Weber, BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets, Nucleic Acids Research, 2024;, gkae314, https://doi.org/10.1093/nar/gkae314

Usage

Clone

Clone this repository to BGCFlow project result in bgcflow/data/processed/<my_project>:

(cd bgcflow/data/processed/<my_project> && git clone [email protected]:matinnuhamunada/dbt_bgcflow.git)

Install dependencies

install using python venv
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
install using mamba
mamba env create -f env.yml

Configure source location

Activate the virtual environment and configures source location by running this python script:

project_dir="bgcflow/data/processed/<my_project>"
python $project_dir/bgcflow_dbt-duckdb/scripts/source_template.py templates/_sources.yml models/sources.yml "7.1.0" "0.30"

Run DBT

dbt debug
dbt build
dbt docs generate
dbt docs serve

Exporting to newer version of DuckDB

Right now newer version of DuckDB is not backward compatible. To migrate the data to newer version, use the script export_duckdb.py:

$ python scripts/export_duckdb.py -h
usage: export_duckdb.py [-h] [--database_filename DATABASE_FILENAME] [--export_directory EXPORT_DIRECTORY]

Export a DuckDB database.

options:
  -h, --help            show this help message and exit
  --database_filename DATABASE_FILENAME
                        The filename of the DuckDB database to export.
  --export_directory EXPORT_DIRECTORY
                        The directory to save the exported database.
  --format {parquet,csv}
                        The format to export the database in.

[WIP] Migrating to PostgreSQL

bash scripts/migrate_postgres_workflow.sh

Credits

This dbt template was inspired adapted from jaffle_shop_duckdb example.

About

An ELT framework based on DBT and DuckDB to process BGCFlow output tables

Resources

Stars

Watchers

Forks

Packages

No packages published