-
Notifications
You must be signed in to change notification settings - Fork 144
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
16 changed files
with
850 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
FROM python:3.11-slim | ||
|
||
ENV LANG=C.UTF-8 | ||
|
||
ARG ARCH="cpu" | ||
|
||
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ | ||
build-essential \ | ||
default-jre \ | ||
libgl1-mesa-glx \ | ||
libjemalloc-dev | ||
|
||
RUN useradd -m -s /bin/bash user && \ | ||
mkdir -p /home/user && \ | ||
chown -R user /home/user/ | ||
|
||
USER user | ||
|
||
COPY comps /home/user/comps | ||
|
||
RUN pip install --no-cache-dir --upgrade pip setuptools && \ | ||
if [ ${ARCH} = "cpu" ]; then pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \ | ||
pip install --no-cache-dir -r /home/user/comps/dataprep/neo4j/langchain/requirements.txt | ||
|
||
ENV PYTHONPATH=$PYTHONPATH:/home/user | ||
|
||
USER root | ||
|
||
RUN mkdir -p /home/user/comps/dataprep/qdrant/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/neo4j/langchain/uploaded_files | ||
|
||
USER user | ||
|
||
WORKDIR /home/user/comps/dataprep/neo4j/langchain | ||
|
||
ENTRYPOINT ["python", "prepare_doc_neo4j.py"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# Dataprep Microservice with Neo4J | ||
|
||
## 🚀Start Microservice with Python | ||
|
||
### Install Requirements | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
apt-get install libtesseract-dev -y | ||
apt-get install poppler-utils -y | ||
``` | ||
|
||
### Start Neo4J Server | ||
|
||
To launch Neo4j locally, first ensure you have docker installed. Then, you can launch the database with the following docker command. | ||
|
||
```bash | ||
docker run \ | ||
-p 7474:7474 -p 7687:7687 \ | ||
-v $PWD/data:/data -v $PWD/plugins:/plugins \ | ||
--name neo4j-apoc \ | ||
-d \ | ||
-e NEO4J_AUTH=neo4j/password \ | ||
-e NEO4J_PLUGINS=\[\"apoc\"\] \ | ||
neo4j:latest | ||
``` | ||
|
||
### Setup Environment Variables | ||
|
||
```bash | ||
export no_proxy=${your_no_proxy} | ||
export http_proxy=${your_http_proxy} | ||
export https_proxy=${your_http_proxy} | ||
export NEO4J_URI=${your_neo4j_url} | ||
export NEO4J_USERNAME=${your_neo4j_username} | ||
export NEO4J_PASSWORD=${your_neo4j_password} | ||
export PYTHONPATH=${path_to_comps} | ||
``` | ||
|
||
### Start Document Preparation Microservice for Neo4J with Python Script | ||
|
||
Start document preparation microservice for Neo4J with below command. | ||
|
||
```bash | ||
python prepare_doc_neo4j.py | ||
``` | ||
|
||
## 🚀Start Microservice with Docker | ||
|
||
### Build Docker Image | ||
|
||
```bash | ||
cd ../../../../ | ||
docker build -t opea/dataprep-neo4j:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/neo4j/langchain/Dockerfile . | ||
``` | ||
|
||
### Run Docker with CLI | ||
|
||
```bash | ||
docker run -d --name="dataprep-neo4j-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-neo4j:latest | ||
``` | ||
|
||
### Setup Environment Variables | ||
|
||
```bash | ||
export no_proxy=${your_no_proxy} | ||
export http_proxy=${your_http_proxy} | ||
export https_proxy=${your_http_proxy} | ||
export NEO4J_URI=${your_neo4j_url} | ||
export NEO4J_USERNAME=${your_neo4j_username} | ||
export NEO4J_PASSWORD=${your_neo4j_password} | ||
``` | ||
|
||
### Run Docker with Docker Compose | ||
|
||
```bash | ||
cd comps/dataprep/neo4j/langchain | ||
docker compose -f docker-compose-dataprep-neo4j.yaml up -d | ||
``` | ||
|
||
## Invoke Microservice | ||
|
||
Once document preparation microservice for Neo4J is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database. | ||
|
||
```bash | ||
curl -X POST \ | ||
-H "Content-Type: multipart/form-data" \ | ||
-F "files=@./file1.txt" \ | ||
http://localhost:6007/v1/dataprep | ||
``` | ||
|
||
You can specify chunk_size and chunk_size by the following commands. | ||
|
||
```bash | ||
curl -X POST \ | ||
-H "Content-Type: multipart/form-data" \ | ||
-F "files=@./file1.txt" \ | ||
-F "chunk_size=1500" \ | ||
-F "chunk_overlap=100" \ | ||
http://localhost:6007/v1/dataprep | ||
``` | ||
|
||
We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast". | ||
|
||
Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`. | ||
|
||
For ensure the quality and comprehensiveness of the extracted entities, we recommend to use `gpt-4o` as the default model for parsing the document. To enable the openai service, please `export OPENAI_KEY=xxxx` before using this services. | ||
|
||
```bash | ||
curl -X POST \ | ||
-H "Content-Type: multipart/form-data" \ | ||
-F "files=@./your_file.pdf" \ | ||
-F "process_table=true" \ | ||
-F "table_strategy=hq" \ | ||
http://localhost:6007/v1/dataprep | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
import os | ||
|
||
# Neo4J configuration | ||
NEO4J_URL = os.getenv("NEO4J_URI", "bolt://localhost:7687") | ||
NEO4J_USERNAME = os.getenv("NEO4J_USERNAME", "neo4j") | ||
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "test") | ||
|
||
# LLM/Embedding endpoints | ||
TGI_LLM_ENDPOINT = os.getenv("TGI_LLM_ENDPOINT", "http://localhost:8080") | ||
TGI_LLM_ENDPOINT_NO_RAG = os.getenv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081") | ||
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_ENDPOINT") | ||
OPENAI_KEY = os.getenv("OPENAI_API_KEY") |
48 changes: 48 additions & 0 deletions
48
comps/dataprep/neo4j/langchain/docker-compose-dataprep-neo4j.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
version: "3" | ||
services: | ||
neo4j-vector-db: | ||
image: neo4j/neo4j | ||
container_name: neo4j-graph-db | ||
ports: | ||
- "6337:6337" | ||
- "6338:6338" | ||
tgi_gaudi_service: | ||
image: ghcr.io/huggingface/tgi-gaudi:2.0.1 | ||
container_name: tgi-service | ||
ports: | ||
- "8088:80" | ||
volumes: | ||
- "./data:/data" | ||
shm_size: 1g | ||
environment: | ||
no_proxy: ${no_proxy} | ||
http_proxy: ${http_proxy} | ||
https_proxy: ${https_proxy} | ||
HF_TOKEN: ${HF_TOKEN} | ||
command: --model-id ${LLM_MODEL_ID} --auto-truncate --max-input-tokens 1024 --max-total-tokens 2048 | ||
dataprep-neo4j: | ||
image: opea/gen-ai-comps:dataprep-neo4j-xeon-server | ||
container_name: dataprep-neo4j-server | ||
depends_on: | ||
- neo4j-vector-db | ||
- tgi_gaudi_service | ||
ports: | ||
- "6007:6007" | ||
ipc: host | ||
environment: | ||
no_proxy: ${no_proxy} | ||
http_proxy: ${http_proxy} | ||
https_proxy: ${https_proxy} | ||
NEO4J_URL: ${NEO4J_URL} | ||
NEO4J_USERNAME: ${NEO4J_USERNAME} | ||
NEO4J_PASSWORD: ${NEO4J_PASSWORD} | ||
TGI_LLM_ENDPOINT: ${TEI_ENDPOINT} | ||
OPENAI_KEY: ${OPENAI_API_KEY} | ||
restart: unless-stopped | ||
|
||
networks: | ||
default: | ||
driver: bridge |
Oops, something went wrong.