Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding dataprep support for CLIP based models for VideoRAGQnA example for v1.0 #621

Merged
merged 32 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
b469976
dataprep service
srinarayan-srikanthan Sep 3, 2024
e87b159
dataprep updates
srinarayan-srikanthan Sep 3, 2024
dc3b5b7
rearranged dirs
srinarayan-srikanthan Sep 4, 2024
4045cb8
added readme
srinarayan-srikanthan Sep 4, 2024
d4c9441
removed checks
srinarayan-srikanthan Sep 4, 2024
40117cb
added features
srinarayan-srikanthan Sep 4, 2024
f9d1e2b
added get method
srinarayan-srikanthan Sep 5, 2024
cde7557
Merge branch 'opea-project:main' into sri-clip-dataprep
srinarayan-srikanthan Sep 5, 2024
ea8e83e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
200e318
add dim at init, rm unused
BaoHuiling Sep 5, 2024
c6e12f1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
b07036e
add wait after connect DB
BaoHuiling Sep 6, 2024
0afc7b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
9261a4a
remove unused
BaoHuiling Sep 6, 2024
b06006a
Update comps/dataprep/vdms/README.md
BaoHuiling Sep 10, 2024
56c578f
add test script for mm case
BaoHuiling Sep 10, 2024
dc11dc2
add return value and update readme
BaoHuiling Sep 10, 2024
04e1224
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 10, 2024
ea465e4
check bug
BaoHuiling Sep 10, 2024
acc7a05
fix mm-script
BaoHuiling Sep 10, 2024
a66da36
add into dataprep workflow
BaoHuiling Sep 10, 2024
2699710
rm whitespace
BaoHuiling Sep 10, 2024
ebe5a91
updated readme and added test script
srinarayan-srikanthan Sep 10, 2024
2b6f6d5
removed unused file
srinarayan-srikanthan Sep 10, 2024
808f1f7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 10, 2024
003cdef
Merge branch 'main' into sri-clip-dataprep
srinarayan-srikanthan Sep 10, 2024
9fe2571
move test script
BaoHuiling Sep 10, 2024
ebe7c7d
restructured repo
srinarayan-srikanthan Sep 11, 2024
cb2c033
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 11, 2024
a8d2657
updates path in test script
srinarayan-srikanthan Sep 11, 2024
717c9bd
Merge branch 'main' into sri-clip-dataprep
srinarayan-srikanthan Sep 11, 2024
1fbc343
add name for build
BaoHuiling Sep 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/docker/compose/dataprep-compose-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,6 @@ services:
build:
dockerfile: comps/dataprep/pinecone/langchain/Dockerfile
image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
dataprep-vdms:
build:
dockerfile: comps/dataprep/vdms/multimodal_langchain/docker/Dockerfile
BaoHuiling marked this conversation as resolved.
Show resolved Hide resolved
189 changes: 189 additions & 0 deletions comps/dataprep/vdms/README.md
srinarayan-srikanthan marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Dataprep Microservice with VDMS

For dataprep microservice, we currently provide one framework: `Langchain`.

<!-- We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.). -->

We organized the folders in the same way, so you can use either framework for dataprep microservice with the following constructions.

# 🚀1. Start Microservice with Python (Option 1)

## 1.1 Install Requirements

Install Single-process version (for 1-10 files processing)

```bash
apt-get update
apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
cd langchain
pip install -r requirements.txt
```

<!-- - option 2: Install multi-process version (for >10 files processing)

```bash
cd langchain_ray; pip install -r requirements_ray.txt
``` -->

## 1.2 Start VDMS Server

Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).

## 1.3 Setup Environment Variables

```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export COLLECTION_NAME=${your_collection_name}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
export PYTHONPATH=${path_to_comps}
```

## 1.4 Start Document Preparation Microservice for VDMS with Python Script

Start document preparation microservice for VDMS with below command.

Start single-process version (for 1-10 files processing)

```bash
python prepare_doc_vdms.py
```

<!-- - option 2: Start multi-process version (for >10 files processing)

```bash
python prepare_doc_redis_on_ray.py
``` -->

# 🚀2. Start Microservice with Docker (Option 2)

## 2.1 Start VDMS Server

Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).

## 2.2 Setup Environment Variables

```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export TEI_ENDPOINT=${your_tei_endpoint}
export COLLECTION_NAME=${your_collection_name}
export SEARCH_ENGINE="FaissFlat"
export DISTANCE_STRATEGY="L2"
export PYTHONPATH=${path_to_comps}
```

## 2.3 Build Docker Image

- Build docker image with langchain

Start single-process version (for 1-10 files processing)

```bash
cd ../../../
docker build -t opea/dataprep-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
```

<!-- - option 2: Start multi-process version (for >10 files processing)

```bash
cd ../../../../
docker build -t opea/dataprep-on-ray-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain_ray/Dockerfile . -->

## 2.4 Run Docker with CLI

Start single-process version (for 1-10 files processing)

```bash
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TEI_ENDPOINT=$TEI_ENDPOINT \
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
opea/dataprep-vdms:latest
```

<!-- - option 2: Start multi-process version (for >10 files processing)

```bash
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy \
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
-e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-vdms:latest
``` -->

# 🚀3. Status Microservice

```bash
docker container logs -f dataprep-vdms-server
```

# 🚀4. Consume Microservice

Once document preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

Make sure the file path after `files=@` is correct.

- Single file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
http://localhost:6007/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands.

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./LLAMA2_page6.pdf" \
-F "chunk_size=1500" \
-F "chunk_overlap=100" \
http://localhost:6007/v1/dataprep
```

- Multiple file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
-F "files=@./file2.txt" \
-F "files=@./file3.txt" \
http://localhost:6007/v1/dataprep
```

- Links upload (not supported for llama_index now)

```bash
curl -X POST \
-F 'link_list=["https://www.ces.tech/"]' \
http://localhost:6007/v1/dataprep
```

or

```python
import requests
import json

proxies = {"http": ""}
url = "http://localhost:6007/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
]
payload = {"link_list": json.dumps(urls)}

try:
resp = requests.post(url=url, data=payload, proxies=proxies)
print(resp.text)
resp.raise_for_status() # Raise an exception for unsuccessful HTTP status codes
print("Request successful!")
except requests.exceptions.RequestException as e:
print("An error occurred:", e)
```
39 changes: 39 additions & 0 deletions comps/dataprep/vdms/langchain/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM python:3.11-slim

ENV LANG=C.UTF-8

ARG ARCH="cpu"

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libcairo2-dev \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/langchain/requirements.txt

ENV PYTHONPATH=/home/user

USER root

RUN mkdir -p /home/user/comps/dataprep/vdms/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/langchain

USER user

WORKDIR /home/user/comps/dataprep/vdms/langchain

ENTRYPOINT ["python", "prepare_doc_vdms.py"]
2 changes: 2 additions & 0 deletions comps/dataprep/vdms/langchain/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
33 changes: 33 additions & 0 deletions comps/dataprep/vdms/langchain/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os


def getEnv(key, default_value=None):
env_value = os.getenv(key, default=default_value)
print(f"{key}: {env_value}")
return env_value


# Embedding model
EMBED_MODEL = getEnv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")

# VDMS configuration
VDMS_HOST = getEnv("VDMS_HOST", "localhost")
VDMS_PORT = int(getEnv("VDMS_PORT", 55555))
COLLECTION_NAME = getEnv("COLLECTION_NAME", "rag-vdms")
SEARCH_ENGINE = getEnv("SEARCH_ENGINE", "FaissFlat")
DISTANCE_STRATEGY = getEnv("DISTANCE_STRATEGY", "L2")

# LLM/Embedding endpoints
TGI_LLM_ENDPOINT = getEnv("TGI_LLM_ENDPOINT", "http://localhost:8080")
TGI_LLM_ENDPOINT_NO_RAG = getEnv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
TEI_EMBEDDING_ENDPOINT = getEnv("TEI_ENDPOINT")

# chunk parameters
CHUNK_SIZE = getEnv("CHUNK_SIZE", 1500)
CHUNK_OVERLAP = getEnv("CHUNK_OVERLAP", 100)

current_file_path = os.path.abspath(__file__)
parent_dir = os.path.dirname(current_file_path)
28 changes: 28 additions & 0 deletions comps/dataprep/vdms/langchain/docker-compose-dataprep-vdms.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3"
services:
vdms-vector-db:
image: intellabs/vdms:latest
container_name: vdms-vector-db
ports:
- "55555:55555"
dataprep-vdms:
image: opea/dataprep-vdms:latest
container_name: dataprep-vdms-server
ports:
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
VDMS_HOST: ${VDMS_HOST}
VDMS_PORT: ${VDMS_PORT}
COLLECTION_NAME: ${COLLECTION_NAME}
restart: unless-stopped

networks:
default:
driver: bridge
Loading
Loading