The CLI tool provides functionality for running a RAG (Retrieval-Augmented Generation) API pipeline. It offers various commands to execute different stages of the pipeline, from data extraction to embedding generation.
This project uses Poetry for dependency management. To install the project and all its dependencies:
- Ensure you have Poetry installed. If not, install it by following the instructions here
- Clone the repository:
git clone https://github.com/raid-guild/gaianet-rag-api-pipeline cd gaianet-rag-api-pipeline
- Install dependencies:
- Using
pip
(production-mode):pip install -e .
- Using
Poetry
(development-mode):poetry install
- Using
- Check the
rag-api-pipeline
CLI is available:- Production mode:
rag-api-pipeline --help
- Development mode:
poetry run rag-api-pipeline --help
- Production mode:
(venv) user$ rag-api-pipeline --help
Usage: rag-api-pipeline [OPTIONS] COMMAND [ARGS]...
Command-line interface (CLI) for the RAG API pipeline.
Options:
--help Show this message and exit.
Commands:
setup Setup wizard to config the pipeline settings prior execution
run Execute RAG API pipeline tasks.
(venv) user$ rag-api-pipeline setup --help
Usage: rag-api-pipeline setup [OPTIONS]
Setup wizard to config the pipeline settings prior execution
Options:
--debug enable logging debug level
--llm-provider [gaia|other] LLM provider [default: gaia]
--help Show this message and exit.
--debug
: Enable logging at debug level. Useful for development purposes--llm-provider [gaia|other]
: Embedding model provider (default: gaia). Check the Other LLM Providers section for details on supported LLM engines.
(venv) user$ rag-api-pipeline run --help
Usage: rag-api-pipeline run [OPTIONS] COMMAND [ARGS]...
Execute RAG API pipeline tasks.
Options:
--debug enable logging debug level
--help Show this message and exit.
Commands:
all Run the complete RAG API Pipeline.
from-chunked Execute the RAG API pipeline from chunked data.
from-normalized Execute the RAG API pipeline from normalized data.
--debug
: Enable logging at debug level. Useful for development purposes
Prior executing any of the commands listed below, it validates that the the pipeline is already setup (config/.env
file exists) and
the required services (LLM Provider + QdrantDB) are up and running. Otherwise, you will receive one of the following error messages:
ERROR: config/.env file not found. rag-api-pipeline setup should be run first.
ERROR: LLM Provider openai (http://localhost:8080/v1/models) is down.
ERROR: QdrantDB (http://localhost:6333) is down.
Executes the entire RAG pipeline including API endpoint data streams, data normalization, data chunking, vector embeddings and database snapshot generation.
(venv) user$ rag-api-pipeline run all --help
Usage: rag-api-pipeline run all [OPTIONS] API_MANIFEST_FILE OPENAPI_SPEC_FILE
Run the complete RAG API Pipeline.
API_MANIFEST_FILE Path to the API manifest YAML file that defines pipeline
config settings and API endpoints.
OPENAPI_SPEC_FILE Path to the OpenAPI YAML specification file.
Options:
--api-key TEXT API Auth key
--source-manifest-file PATH Source YAML manifest
--full-refresh Clean up cache and extract API data from
scratch
--normalized-only Run pipeline until the normalized data stage
--chunked-only Run pipeline until the chunked data stage
--help Show this message and exit.
API_MANIFEST_FILE
: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.OPENAPI_SPEC_FILE
: OpenAPI YAML spec file.
--api-key TEXT
: API Auth key. If specified, it overrides the pipeline settings.--source-manifest-file FILE
: Airbyte's Source Connector YAML manifest. If specified, it skips the connector manifest generation stage and will use this file to start the API data extraction.--full-refresh
: Clean up cache and extract API data from scratch--normalized-only
: Run the pipeline process until the normalized data stage--chunked-only
: Run the pipeline process until the chunked data stage
Executes the RAG data pipeline using an already normalized JSONL dataset.
(venv) user$ rag-api-pipeline run from-normalized --help
Usage: rag-api-pipeline run from-normalized [OPTIONS] API_MANIFEST_FILE
Execute the RAG API pipeline from normalized data.
API_MANIFEST_FILE Path to the API manifest YAML file that defines pipeline
config settings and API endpoints.
Options:
--normalized-data-file PATH Normalized data in JSONL format [required]
--help Show this message and exit.
API_MANIFEST_FILE
: Pipeline YAML manifest that defines the Pipeline config settings.
--normalized-data-file FILE
(required): Normalized data in JSONL format.
Executes the RAG data pipeline using an already chunked dataset in JSONL format.
(venv) user$ rag-api-pipeline run from-chunked --help
Usage: rag-api-pipeline run from-chunked [OPTIONS] API_MANIFEST_FILE
Execute the RAG API pipeline from chunked data.
API_MANIFEST_FILE Path to the API manifest YAML file that defines pipeline
config settings and API endpoints.
Options:
--chunked-data-file PATH Chunked data in JSONL format [required]
--help Show this message and exit.
API_MANIFEST_FILE
: Pipeline YAML manifest that defines the Pipeline config settings and API endpoints to extract.
--chunked-data-file FILE
(required): Chunked data in JSONL format.
- Clean up cachen and run the complete pipeline:
rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_openapi.yaml --full-refresh
- Run the pipeline and stop executing after data normalization:
rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_openapi.yaml --normalized-only
- Run the pipeline from normalized data:
rag-api-pipeline run from-normalized config/boardroom_api_pipeline.yaml --normalized-data-file path/to/normalized_data.jsonl
- Run the pipeline from chunked data:
rag-api-pipeline run from-chunked config/boardroom_api_pipeline.yaml --chunked-data-file path/to/chunked_data.jsonl
Cached API stream data and results produced from running any of the rag-api-pipeline run
commands are stored in the output/<api_name>
folder.
The following files and folders are created by the tool within this baseDir
folder:
{baseDir}/{api_name}_source_generated.yaml
: generated Airbyte Stream connector manifest.{baseDir}/cache/{api_name}/*
: extracted API data is cached into a local DuckDB. Database files are stored in this directory. If the--full-refresh
argument is specified on therun all
command, the cache will be cleared and API data will be extracted from scratch.{baseDir}/{api_name}_stream_{x}_preprocessed.jsonl
: data streams from each API endpoint are preprocessed and stored in JSONL format.{baseDir}/{api_name}_normalized.jsonl
: preprocessed data streams from each API endpoint are joined together and stored in JSONL format.{baseDir}/{api_name}_chunked.jsonl
: normalized data that goes through the data chunking stage is then stored in JSONL format.{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot
: vector embeddings snapshot file that was exported from Qdrant DB.{baseDir}/{api_name}_collection-xxxxxxxxxxxxxxxx-yyyy-mm-dd-hh-mm-ss.snapshot.tar.gz
: compressed knowledge base that contains the vector embeddings snapshot.