Retriever for PGVector (#213)

* Retriever for PGVector Signed-off-by: V, Ganesan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added testcase and fixed folder structure Signed-off-by: V, Ganesan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressed review comments Signed-off-by: V, Ganesan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes for test case failure Signed-off-by: V, Ganesan <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: V, Ganesan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Jun 24, 2024 · 75eff63 · 75eff63
1 parent 5748471
commit 75eff63
Show file tree

Hide file tree

Showing 9 changed files with 366 additions and 0 deletions.
diff --git a/comps/retrievers/README.md b/comps/retrievers/README.md
@@ -13,3 +13,7 @@ For details, please refer to this [readme](langchain/redis/README.md)
 # Retriever Microservice with Milvus
 
 For details, please refer to this [readme](langchain/milvus/README.md)
+
+# Retriever Microservice with PGVector
+
+For details, please refer to this [readme](langchain/pgvector/README.md)
diff --git a/comps/retrievers/langchain/pgvector/README.md b/comps/retrievers/langchain/pgvector/README.md
@@ -0,0 +1,123 @@
+# Retriever Microservice
+
+This retriever microservice is a highly efficient search service designed for handling and retrieving embedding vectors. It operates by receiving an embedding vector as input and conducting a similarity search against vectors stored in a VectorDB database. Users must specify the VectorDB's URL and the index name, and the service searches within that index to find documents with the highest similarity to the input vector.
+
+The service primarily utilizes similarity measures in vector space to rapidly retrieve contentually similar documents. The vector-based retrieval approach is particularly suited for handling large datasets, offering fast and accurate search results that significantly enhance the efficiency and quality of information retrieval.
+
+Overall, this microservice provides robust backend support for applications requiring efficient similarity searches, playing a vital role in scenarios such as recommendation systems, information retrieval, or any other context where precise measurement of document similarity is crucial.
+
+# 🚀1. Start Microservice with Python (Option 1)
+
+To start the retriever microservice, you must first install the required python packages.
+
+## 1.1 Install Requirements
+
+```bash
+pip install -r requirements.txt
+```
+
+## 1.2 Start TEI Service
+
+```bash
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/retriever"
+model=BAAI/bge-base-en-v1.5
+revision=refs/pr/4
+volume=$PWD/data
+docker run -d -p 6060:80 -v $volume:/data -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
+```
+
+## 1.3 Verify the TEI Service
+
+```bash
+curl 127.0.0.1:6060/rerank \
+    -X POST \
+    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
+    -H 'Content-Type: application/json'
+```
+
+## 1.4 Setup VectorDB Service
+
+You need to setup your own VectorDB service (PGvector in this example), and ingest your knowledge documents into the vector database.
+
+As for PGVector, you could start a docker container using the following commands.
+Remember to ingest data into it manually.
+
+```bash
+export POSTGRES_USER=testuser
+export POSTGRES_PASSWORD=testpwd
+export POSTGRES_DB=vectordb
+
+docker run --name vectorstore-postgres -e POSTGRES_USER=${POSTGRES_USER} -e POSTGRES_HOST_AUTH_METHOD=trust -e POSTGRES_DB=${POSTGRES_DB} -e POSTGRES_PASSWORD=${POSTGRES_PASSWORD} -d -v ./init.sql:/docker-entrypoint-initdb.d/init.sql -p 5432:5432 pgvector/pgvector:0.7.0-pg16
+```
+
+## 1.5 Start Retriever Service
+
+```bash
+export TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060"
+python retriever_pgvector.py
+```
+
+# 🚀2. Start Microservice with Docker (Option 2)
+
+## 2.1 Setup Environment Variables
+
+```bash
+export RETRIEVE_MODEL_ID="BAAI/bge-base-en-v1.5"
+export PG_CONNECTION_STRING=postgresql+psycopg2://testuser:testpwd@${your_ip}:5432/vectordb
+export INDEX_NAME=${your_index_name}
+export TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060"
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=${your_langchain_api_key}
+export LANGCHAIN_PROJECT="opea/retrievers"
+```
+
+## 2.2 Build Docker Image
+
+```bash
+cd comps/retrievers/langchain/pgvector/docker
+docker build -t opea/retriever-pgvector:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/langchain/pgvector/docker/Dockerfile .
+```
+
+To start a docker container, you have two options:
+
+- A. Run Docker with CLI
+- B. Run Docker with Docker Compose
+
+You can choose one as needed.
+
+## 2.3 Run Docker with CLI (Option A)
+
+```bash
+docker run -d --name="retriever-pgvector" -p 7000:7000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e PG_CONNECTION_STRING=$PG_CONNECTION_STRING  -e INDEX_NAME=$INDEX_NAME -e TEI_ENDPOINT=$TEI_ENDPOINT opea/retriever-pgvector:latest
+```
+
+## 2.4 Run Docker with Docker Compose (Option B)
+
+```bash
+cd comps/retrievers/langchain/pgvector/docker
+docker compose -f docker_compose_retriever.yaml up -d
+```
+
+# 🚀3. Consume Retriever Service
+
+## 3.1 Check Service Status
+
+```bash
+curl http://localhost:7000/v1/health_check \
+  -X GET \
+  -H 'Content-Type: application/json'
+```
+
+## 3.2 Consume Embedding Service
+
+To consume the Retriever Microservice, you can generate a mock embedding vector of length 768 with Python.
+
+```bash
+your_embedding=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
+curl http://${your_ip}:7000/v1/retrieval \
+  -X POST \
+  -d "{\"text\":\"What is the revenue of Nike in 2023?\",\"embedding\":${your_embedding}}" \
+  -H 'Content-Type: application/json'
+```
diff --git a/comps/retrievers/langchain/pgvector/config.py b/comps/retrievers/langchain/pgvector/config.py
@@ -0,0 +1,17 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+# Embedding model
+
+EMBED_MODEL = os.getenv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
+
+PG_CONNECTION_STRING = os.getenv("PG_CONNECTION_STRING", "localhost")
+
+# Vector Index Configuration
+INDEX_NAME = os.getenv("INDEX_NAME", "rag-pgvector")
+
+current_file_path = os.path.abspath(__file__)
+parent_dir = os.path.dirname(current_file_path)
+PORT = os.getenv("RETRIEVER_PORT", 7000)
diff --git a/comps/retrievers/langchain/pgvector/docker/Dockerfile b/comps/retrievers/langchain/pgvector/docker/Dockerfile
@@ -0,0 +1,29 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM langchain/langchain:latest
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    vim
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+COPY comps /home/user/comps
+
+RUN chmod +x /home/user/comps/retrievers/langchain/pgvector/run.sh
+
+USER user
+
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r /home/user/comps/retrievers/langchain/pgvector/requirements.txt
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user
+
+WORKDIR /home/user/comps/retrievers/langchain/pgvector
+
+ENTRYPOINT ["/home/user/comps/retrievers/langchain/pgvector/run.sh"]
diff --git a/comps/retrievers/langchain/pgvector/docker/docker_compose_retriever.yaml b/comps/retrievers/langchain/pgvector/docker/docker_compose_retriever.yaml
@@ -0,0 +1,31 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3.8"
+
+services:
+  tei_xeon_service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
+    container_name: tei-xeon-server
+    ports:
+      - "6060:80"
+    volumes:
+      - "./data:/data"
+    shm_size: 1g
+    command: --model-id ${RETRIEVE_MODEL_ID}
+  retriever:
+    image: opea/retriever-pgvector:latest
+    container_name: retriever-pgvector
+    ports:
+      - "7000:7000"
+    ipc: host
+    environment:
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      PG_CONNECTION_STRING: ${PG_CONNECTION_STRING}
+      LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge
diff --git a/comps/retrievers/langchain/pgvector/requirements.txt b/comps/retrievers/langchain/pgvector/requirements.txt
@@ -0,0 +1,14 @@
+docarray[full]
+easyocr
+fastapi
+langchain_community
+langsmith
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-sdk
+pgvector==0.2.5
+prometheus-fastapi-instrumentator==7.0.0
+psycopg2-binary
+pymupdf
+sentence_transformers
+shortuuid
diff --git a/comps/retrievers/langchain/pgvector/retriever_pgvector.py b/comps/retrievers/langchain/pgvector/retriever_pgvector.py
@@ -0,0 +1,60 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+import time
+
+from config import EMBED_MODEL, INDEX_NAME, PG_CONNECTION_STRING, PORT
+from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceHubEmbeddings
+from langchain_community.vectorstores import PGVector
+from langsmith import traceable
+
+from comps import (
+    EmbedDoc768,
+    SearchedDoc,
+    ServiceType,
+    TextDoc,
+    opea_microservices,
+    register_microservice,
+    register_statistics,
+    statistics_dict,
+)
+
+tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT")
+
+
+@register_microservice(
+    name="opea_service@retriever_pgvector",
+    service_type=ServiceType.RETRIEVER,
+    endpoint="/v1/retrieval",
+    host="0.0.0.0",
+    port=PORT,
+)
+@traceable(run_type="retriever")
+@register_statistics(names=["opea_service@retriever_pgvector"])
+def retrieve(input: EmbedDoc768) -> SearchedDoc:
+    start = time.time()
+    search_res = vector_db.similarity_search_by_vector(embedding=input.embedding)
+    searched_docs = []
+    for r in search_res:
+        searched_docs.append(TextDoc(text=r.page_content))
+    result = SearchedDoc(retrieved_docs=searched_docs, initial_query=input.text)
+    statistics_dict["opea_service@retriever_pgvector"].append_latency(time.time() - start, None)
+    return result
+
+
+if __name__ == "__main__":
+    # Create vectorstore
+    if tei_embedding_endpoint:
+        # create embeddings using TEI endpoint service
+        embeddings = HuggingFaceHubEmbeddings(model=tei_embedding_endpoint)
+    else:
+        # create embeddings using local embedding model
+        embeddings = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)
+
+    vector_db = PGVector(
+        embedding_function=embeddings,
+        collection_name=INDEX_NAME,
+        connection_string=PG_CONNECTION_STRING,
+    )
+    opea_microservices["opea_service@retriever_pgvector"].start()
diff --git a/comps/retrievers/langchain/pgvector/run.sh b/comps/retrievers/langchain/pgvector/run.sh
@@ -0,0 +1,9 @@
+#!/bin/sh
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+cd /home/user/comps/retrievers/langchain/pgvector
+python ingest.py
+
+python retriever_pgvector.py
diff --git a/tests/test_retrievers_langchain_pgvector.sh b/tests/test_retrievers_langchain_pgvector.sh