Retrieval augmented generation (RAG) demos with DeepSeek, Qwen, Aya-Expanse, Mistral, Gemma, Llama, Phi
The demos use quantized models and run on CPU with acceptable inference time. They can run offline without Internet access, thus allowing deployment in an air-gapped environment.
The demos also allow user to
- apply propositionizer to document chunks
- perform reranking upon retrieval
- perform hypothetical document embedding (HyDE)
You will need to set up your development environment using conda, which you can install directly.
conda env create --name rag python=3.11
conda activate rag
pip install -r requirements.txt
We shall use unstructured
to process PDFs. Refer to nstallation Instructions for Local Development.
You would also need to download punkt_tab
and averaged_perceptron_tagger_eng
from nltk.
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
Note that we shall only use strategy="fast"
in this demo. WIP for extraction of tables from PDFs.
Activate the environment.
conda activate rag
Download and save the models in ./models
and update config.yaml
. The models used in this demo are:
- Embeddings
- Rerankers:
- Propositionizer
- LLMs
- unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF
- Qwen/Qwen2.5-3B-Instruct-GGUF
- bartowski/aya-expanse-8b-GGUF
- bartowski/Llama-3.2-3B-Instruct-GGUF
- allenai/OLMoE-1B-7B-0924-Instruct-GGUF
- bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
- microsoft/Phi-3-mini-4k-instruct-gguf
- QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
- lmstudio-ai/gemma-2b-it-GGUF
- TheBloke/zephyr-7B-beta-GGUF
- TheBloke/Mistral-7B-Instruct-v0.2-GGUF
- TheBloke/Llama-2-7B-Chat-GGUF
The LLMs can be loaded directly in the app, or they can be first deployed with Ollama.
Since each model type has its own prompt format, include the format in ./src/prompt_templates.py
.
We shall use Phoenix for LLM tracing. Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. Before running the app, start a phoenix server
python3 -m phoenix.server.main serve
The traces can be viewed at http://localhost:6006
.
We use Streamlit as the interface for the demos. There are three demos:
- Conversational Retrieval QA
streamlit run app_conv.py
- Retrieval QA
streamlit run app_qa.py
- Conversational Retrieval QA using ReAct
Create vectorstore first and update config.yaml
python -m vectorize --filepaths <your-filepath>
Run the app
streamlit run app_react.py
To get started, upload a PDF and click on Build VectorDB
. Creating vector DB will take a while.