An ELT framework to build a DuckDB OLAP database from BGCFlow run using dbt.
Matin Nuhamunada, Omkar S Mohite, Patrick V Phaneuf, Bernhard O Palsson, Tilmann Weber, BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets, Nucleic Acids Research, 2024;, gkae314, https://doi.org/10.1093/nar/gkae314
Clone this repository to BGCFlow project result in bgcflow/data/processed/<my_project>
:
(cd bgcflow/data/processed/<my_project> && git clone [email protected]:matinnuhamunada/dbt_bgcflow.git)
install using python venv
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
install using mamba
mamba env create -f env.yml
Activate the virtual environment and configures source location by running this python script:
project_dir="bgcflow/data/processed/<my_project>"
python $project_dir/bgcflow_dbt-duckdb/scripts/source_template.py templates/_sources.yml models/sources.yml "7.1.0" "0.30"
dbt debug
dbt build
dbt docs generate
dbt docs serve
Right now newer version of DuckDB is not backward compatible. To migrate the data to newer version, use the script export_duckdb.py
:
$ python scripts/export_duckdb.py -h
usage: export_duckdb.py [-h] [--database_filename DATABASE_FILENAME] [--export_directory EXPORT_DIRECTORY]
Export a DuckDB database.
options:
-h, --help show this help message and exit
--database_filename DATABASE_FILENAME
The filename of the DuckDB database to export.
--export_directory EXPORT_DIRECTORY
The directory to save the exported database.
--format {parquet,csv}
The format to export the database in.
bash scripts/migrate_postgres_workflow.sh
This dbt template was inspired adapted from jaffle_shop_duckdb example.