Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update training to accomodate antiSMASH and BGCFlow schema #8

Merged
merged 5 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,4 @@ site/
Chinook.sqlite
chroma.sqlite3
notebooks
chatbgc_env
47 changes: 42 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,46 @@

</p>

Ask questions about biosynthetic gene clusters in your genome dataset via LLMs using Retrieval-Augmented Generation.
Ask questions about BGCs in your genome dataset generated by [`BGCFlow`](https://github.com/NBChub/bgcflow) using Large Language Models (LLMs).

* Free software: MIT
* Documentation: <https://nbchub.github.io/chatBGC/>
This python package utilizes vector-based Retrieval-Augmented Generation (RAG) to translate natural language (English) to SQL Queries and is trained to query information from `antiSMASH` and other genome mining tools included in the [`BGCFlow`](https://github.com/NBChub/bgcflow) pipelines.

![RAG](chatbgc/assets/3_RAG.png)

## Quickstart
For a quick use of `chatBGC`, you will need an [OpenAI API Key](https://platform.openai.com/api-keys) and the DuckDB database of your `BGCFlow` run (generated using `bgcflow build database`). See [BGCFlow Wiki](https://github.com/NBChub/bgcflow/wiki/04-Building-and-Serving-OLAP-Database) for more details on creating the database.

```bash
# Setup API Key
OPENAI_API_KEY="<change this to your API Key>"

# Create new folder to set up duckdb and vector database
mkdir chatbgc
cd chatbgc

# Copy the database build using BGCFlow to this directory
BGCFLOW_DIR="<change this to your BGCFlow directory>"
PROJECT_NAME="<change this to your BGCFlow project name>"
ANTISMASH_VERSION="7.1.0" # change this to the correct antiSMASH version used in your BGCFlow run. Only supports version 7.1.0 or above
cp $BGCFLOW_DIR/data/processed/$PROJECT_NAME/dbt/antiSMASH_7.1.0/dbt_bgcflow.duckdb dbt_bgcflow.duckdb -n

# Create python environment and install ChatBGC
python3 -m venv chatbgc_env
source chatbgc_env/bin/activate
python3 -m pip install --upgrade pip
pip install git+https://github.com/NBChub/chatBGC.git

# Setup variable environment / secrets
touch .env
echo "export OPENAI_API_KEY=$OPENAI_API_KEY" > .env
source .env

# Train ChatBGC (Do it once)
chatbgc train --llm_type openai_chat --model gpt-4o dbt_bgcflow.duckdb

# Run ChatBGC
chatbgc run --llm_type openai_chat --model gpt-4o dbt_bgcflow.duckdb
```

## Configuration

Expand Down Expand Up @@ -106,8 +142,9 @@ chatbgc run <path_to_duckdb>
chatbgc run --llm_type openai_chat <path_to_duckdb>
```

## Development guide (TO DO)
- Create duckdb schema
## Notes
* Free software: MIT
* Documentation: <https://nbchub.github.io/chatBGC/>

## Credits

Expand Down
2 changes: 1 addition & 1 deletion chatbgc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

__author__ = """Matin Nuhamunada"""
__email__ = "[email protected]"
__version__ = "0.1.3"
__version__ = "0.2.0"
Binary file added chatbgc/assets/3_RAG.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion chatbgc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def train(
self,
duckdb_path,
model="llama3",
training_folder=str((Path(__file__).parent / "data").resolve()),
training_folder=str((Path(__file__).parent / "data/").resolve()),
llm_type="ollama",
):
"""
Expand Down
File renamed without changes.
File renamed without changes.
Loading
Loading