This is a playground to use ColBERT It does not use RAGatouille
The problem with RAGatouille are
- It does not expose all the ColBERT configurations.
- It is difficult directly integrate with LangChain's Embeddings class
The latest version of colbert-ai==0.2.19
or its dependencies require pyarraow==14.0.0
Install faiss-gpu
on CUDA
Code is at this folder that includes
- A ColBERT Embedding class
- Astra loader
- Astra vector based retriever, a LangChain compatible retriever
- It runs on CPU and GPU/Cuda (automatically runs all available GPUs) A chat bot example of RAG using ColBERT embedding, Astra DB vector store, retriever (including a default ranker).
How to run the example and prerequisites:
- Specify the directory of pdf files
- Create a AstraDB keyspace and specify the keyspace name in the example code
- Download Secure Connect Bundle and specify the path in the example
- Create an AstraCS token to export as
ASTRA_TOKEN
cd webserver
poetry install
poetry shell
cd webserver
python example.py
- A web server for embedding service
- Dockerfile of the web embedding service
- Indexing and encoding examples[example] to test on GPU.
Load, split and prepare the documents
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
import os
# pip install pypdf
loader =DirectoryLoader(
path="./files",
glob="**/*.pdf",
loader_cls=PyPDFLoader,
recursive=True,
)
docs = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=500, # colbert doc_maxlen is 220
chunk_overlap=100,
length_function=len,
)
splits = text_splitter.split_documents(docs)
title = docs[0].metadata['source']
collections = []
for part in splits:
collections.append(part.page_content)
from embedding import ColbertTokenEmbeddings
colbert = ColbertTokenEmbeddings(
doc_maxlen=220,
nbits=1,
kmeans_niters=4,
nranks=1,
)
passageEmbeddings = colbert.embed_documents(texts=collections, title=title)
Create tables and load embeddings
from embedding import AstraDB
import os
# astra db
astra = AstraDB(
secure_connect_bundle="./secure-connect-mingv1.zip",
astra_token=os.getenv("ASTRA_TOKEN"),
keyspace="colbert128"
)
from embedding import ColbertAstraRetriever
retriever = ColbertAstraRetriever(astraDB=astra, colbertEmbeddings=colbert)
answers = retriever.retrieve("what's the toll free number to call for help?")
A web embedding service is implemented to provide ColBERT text embedding over HTTP.
Commands to set up dev environment.
cd webserver
poetry install
poetry shell
cd webserver
uvicorn main:app --reload