rag-api-pipeline
is a Python-based data pipeline tool that allows you to easily generate a vector knowledge base from any REST API data source. The resulting database snapshot can be then plugged-in into a Gaia node's LLM model with a prompt and provide contextual responses to user queries using RAG (Retrieval Augmented Generation).
The following sections help you to quickly setup and execute the pipeline on your REST API. If you're looking more in-depth information about how to use this tool, tech stack and/or how it works under the hood, check the official documentation site.
- Python 3.11.x
- Poetry (Docs)
- (Optional): a Python virtual environment manager of your preference (e.g. conda, venv)
- Qdrant vector database (Docs)
- (Optional): Docker to spin up a local container
- LLM model provider (spin up your own Gaia node or pick one from the Gaia public network)
- An Embeddings model (e.g. Nomic-embed-text-v1.5)
Git clone or download this repository to your local machine.
git clone https://github.com/raid-guild/gaianet-rag-api-pipeline.git
It is recommended to activate your own virtual environment.
Then, navigate to the directory where this repository was cloned/download and execute the following command to install the rag-api-pipeline
CLI:
cd gaianet-rag-api-pipeline
pip install -e .
Run the following command to start the pipeline setup wizard. You can use the default configuration settings or customize it for your specific needs:
rag-api-pipeline setup
Check the Setup CLI Reference page for more details.
A quick demo that extracts data from the Boardroom API can be executed by running the following command:
rag-api-pipeline run all config/boardroom_api_pipeline.yaml config/boardroom_openapi.yaml
You are required two specify two main arguments to the pipeline:
- The path to the OpenAPI specification file (e.g.
config/boardroom_openapi.yaml
): the OpenAPI spec for the REST API data source you're looking to extract data from. - The path to the API pipeline manifest file (e.g.
config/boardroom_api_pipeline.yaml
): a YAML file that defines API endpoints you're looking to extract data from, among other parameters (more details in the next section).
Once the pipeline execution is completed, you'll find the vector database snapshot and extracted/processed datasets under the output/molochdao_boardroom_api
folder.
Now it's time to define the pipeline manifest for the REST API you're looking to extract data from. Make sure you get the OpenAPI specification for the API you're targeting. Check the Defining an API Pipeline Manifest page for details on how to get the OpenAPI spec and define an API pipeline manifest, or take a look at the in-depth review of the sample manifests available in the API Examples folder.
Once you have both the API pipeline manifest and OpenAPI spec files, you're ready to start using the rag-api-pipeline run
command to execute different tasks of the RAG pipeline,
from extracting data from an API source to generating vector embeddings and a database snapshot. If you need more details about the parameters available
on each task you can execute:
rag-api-pipeline run <command> --help
Below is the list of available commands. Check the CLI Reference documentation for more details:
# run the entire pipeline
rag-api-pipeline run all <API_MANIFEST_FILE> <OPENAPI_SPEC_FILE> [--full-refresh]
# or run using an already normalized dataset
rag-api-pipeline run from-normalized <API_MANIFEST_FILE> --normalized-data-file <jsonl-file>
# or run using an already chunked dataset
rag-api-pipeline run from-chunked <API_MANIFEST_FILE> --chunked-data-file <jsonl-file>
- If trying to install
pillow-heif
missinng module:- Add the following flags
export CFLAGS="-Wno-nullability-completeness"
- Add the following flags
- Libraries required for having libmagic working:
- MacOS:
brew install libmagic
pip install python-magic-bin
- MacOS:
This project uses Vocs framework for generating the Documentation site. If you want to run it locally and contribute, you should run the following commands:
pnpm install
pnpm run dev
To reflect any updates to Github pages, you need to build and deploy the updated documentation by executing the following commands:
pnpm run build
pnpm run deploy
🛠️ Built 🛠️ with ❤️ by RaidGuild