Skip to content

Commit

Permalink
Enable GraphRAG with Neo4J (opea-project#682)
Browse files Browse the repository at this point in the history
* add graphrag for neo4j

Signed-off-by: XuhuiRen <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add

Signed-off-by: XuhuiRen <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add

Signed-off-by: XuhuiRen <[email protected]>

* add

Signed-off-by: XuhuiRen <[email protected]>

* fix ut

Signed-off-by: XuhuiRen <[email protected]>

* fix

Signed-off-by: XuhuiRen <[email protected]>

* add

Signed-off-by: XuhuiRen <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update retriever_neo4j.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add

Signed-off-by: XuhuiRen <[email protected]>

* Update test_retrievers_neo4j_langchain.sh

* add

Signed-off-by: XuhuiRen <[email protected]>

* Update test_retrievers_neo4j_langchain.sh

* Update test_retrievers_neo4j_langchain.sh

* Update test_retrievers_neo4j_langchain.sh

* add docker

Signed-off-by: XuhuiRen <[email protected]>

* Update retrievers-compose-cd.yaml

* Update test_retrievers_neo4j_langchain.sh

* Update config.py

* Update test_retrievers_neo4j_langchain.sh

* Update test_retrievers_neo4j_langchain.sh

* Update config.py

* Update test_retrievers_neo4j_langchain.sh

* Update requirements.txt

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update requirements.txt

* Update requirements.txt

* Update requirements.txt

---------

Signed-off-by: XuhuiRen <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: lvliang-intel <[email protected]>
  • Loading branch information
3 people authored Sep 15, 2024
1 parent 18092f3 commit 29fe569
Show file tree
Hide file tree
Showing 16 changed files with 850 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/docker/compose/dataprep-compose-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ services:
build:
dockerfile: comps/dataprep/vdms/langchain/Dockerfile
image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}
dataprep-neo4j:
build:
dockerfile: comps/dataprep/neo4j/langchain/Dockerfile
image: ${REGISTRY:-opea}/dataprep-neo4j:${TAG:-latest}
4 changes: 4 additions & 0 deletions .github/workflows/docker/compose/retrievers-compose-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ services:
build:
dockerfile: comps/retrievers/multimodal/redis/langchain/Dockerfile
image: ${REGISTRY:-opea}/multimodal-retriever-redis:${TAG:-latest}
retriever-neo4j:
build:
dockerfile: comps/retrievers/neo4j/langchain/Dockerfile
image: ${REGISTRY:-opea}/retriever-neo4j:${TAG:-latest}
38 changes: 38 additions & 0 deletions comps/dataprep/neo4j/langchain/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

ENV LANG=C.UTF-8

ARG ARCH="cpu"

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
default-jre \
libgl1-mesa-glx \
libjemalloc-dev

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/neo4j/langchain/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

USER root

RUN mkdir -p /home/user/comps/dataprep/qdrant/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/neo4j/langchain/uploaded_files

USER user

WORKDIR /home/user/comps/dataprep/neo4j/langchain

ENTRYPOINT ["python", "prepare_doc_neo4j.py"]
116 changes: 116 additions & 0 deletions comps/dataprep/neo4j/langchain/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Dataprep Microservice with Neo4J

## 🚀Start Microservice with Python

### Install Requirements

```bash
pip install -r requirements.txt
apt-get install libtesseract-dev -y
apt-get install poppler-utils -y
```

### Start Neo4J Server

To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command.

```bash
docker run \
-p 7474:7474 -p 7687:7687 \
-v $PWD/data:/data -v $PWD/plugins:/plugins \
--name neo4j-apoc \
-d \
-e NEO4J_AUTH=neo4j/password \
-e NEO4J_PLUGINS=\[\"apoc\"\] \
neo4j:latest
```

### Setup Environment Variables

```bash
export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export NEO4J_URI=${your_neo4j_url}
export NEO4J_USERNAME=${your_neo4j_username}
export NEO4J_PASSWORD=${your_neo4j_password}
export PYTHONPATH=${path_to_comps}
```

### Start Document Preparation Microservice for Neo4J with Python Script

Start document preparation microservice for Neo4J with below command.

```bash
python prepare_doc_neo4j.py
```

## 🚀Start Microservice with Docker

### Build Docker Image

```bash
cd ../../../../
docker build -t opea/dataprep-neo4j:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/neo4j/langchain/Dockerfile .
```

### Run Docker with CLI

```bash
docker run -d --name="dataprep-neo4j-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-neo4j:latest
```

### Setup Environment Variables

```bash
export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export NEO4J_URI=${your_neo4j_url}
export NEO4J_USERNAME=${your_neo4j_username}
export NEO4J_PASSWORD=${your_neo4j_password}
```

### Run Docker with Docker Compose

```bash
cd comps/dataprep/neo4j/langchain
docker compose -f docker-compose-dataprep-neo4j.yaml up -d
```

## Invoke Microservice

Once document preparation microservice for Neo4J is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
http://localhost:6007/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands.

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
-F "chunk_size=1500" \
-F "chunk_overlap=100" \
http://localhost:6007/v1/dataprep
```

We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".

Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.

For ensure the quality and comprehensiveness of the extracted entities, we recommend to use `gpt-4o` as the default model for parsing the document. To enable the openai service, please `export OPENAI_KEY=xxxx` before using this services.

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./your_file.pdf" \
-F "process_table=true" \
-F "table_strategy=hq" \
http://localhost:6007/v1/dataprep
```
2 changes: 2 additions & 0 deletions comps/dataprep/neo4j/langchain/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
15 changes: 15 additions & 0 deletions comps/dataprep/neo4j/langchain/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os

# Neo4J configuration
NEO4J_URL = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "test")

# LLM/Embedding endpoints
TGI_LLM_ENDPOINT = os.getenv("TGI_LLM_ENDPOINT", "http://localhost:8080")
TGI_LLM_ENDPOINT_NO_RAG = os.getenv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_ENDPOINT")
OPENAI_KEY = os.getenv("OPENAI_API_KEY")
48 changes: 48 additions & 0 deletions comps/dataprep/neo4j/langchain/docker-compose-dataprep-neo4j.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3"
services:
neo4j-vector-db:
image: neo4j/neo4j
container_name: neo4j-graph-db
ports:
- "6337:6337"
- "6338:6338"
tgi_gaudi_service:
image: ghcr.io/huggingface/tgi-gaudi:2.0.1
container_name: tgi-service
ports:
- "8088:80"
volumes:
- "./data:/data"
shm_size: 1g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
HF_TOKEN: ${HF_TOKEN}
command: --model-id ${LLM_MODEL_ID} --auto-truncate --max-input-tokens 1024 --max-total-tokens 2048
dataprep-neo4j:
image: opea/gen-ai-comps:dataprep-neo4j-xeon-server
container_name: dataprep-neo4j-server
depends_on:
- neo4j-vector-db
- tgi_gaudi_service
ports:
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
NEO4J_URL: ${NEO4J_URL}
NEO4J_USERNAME: ${NEO4J_USERNAME}
NEO4J_PASSWORD: ${NEO4J_PASSWORD}
TGI_LLM_ENDPOINT: ${TEI_ENDPOINT}
OPENAI_KEY: ${OPENAI_API_KEY}
restart: unless-stopped

networks:
default:
driver: bridge
Loading

0 comments on commit 29fe569

Please sign in to comment.