Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine Dataprep Milvus MS #570

Merged
merged 12 commits into from
Aug 29, 2024
139 changes: 126 additions & 13 deletions comps/dataprep/milvus/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Dataprep Microservice with Milvus

## 🚀Start Microservice with Python
## 🚀1. Start Microservice with Python (Option 1)

### Install Requirements
### 1.1 Requirements

```bash
pip install -r requirements.txt
Expand All @@ -11,11 +11,11 @@ apt-get install libtesseract-dev -y
apt-get install poppler-utils -y
```

### Start Milvus Server
### 1.2 Start Milvus Server

Please refer to this [readme](../../../vectorstores/langchain/milvus/README.md).

### Setup Environment Variables
### 1.3 Setup Environment Variables

```bash
export no_proxy=${your_no_proxy}
Expand All @@ -27,30 +27,76 @@ export COLLECTION_NAME=${your_collection_name}
export MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
```

### Start Document Preparation Microservice for Milvus with Python Script
### 1.4 Start Mosec Embedding Service

First, you need to build a mosec embedding serving docker image.

```bash
cd ../../..
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-mosec-endpoint:latest -f comps/embeddings/langchain-mosec/mosec-docker/Dockerfile .
```

Then start the mosec embedding server.

```bash
your_port=6010
docker run -d --name="embedding-mosec-endpoint" -p $your_port:8000 opea/embedding-mosec-endpoint:latest
```

Setup environment variables:

```bash
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
export MILVUS=${your_host_ip}
```

### 1.5 Start Document Preparation Microservice for Milvus with Python Script

Start document preparation microservice for Milvus with below command.

```bash
python prepare_doc_milvus.py
```

## 🚀Start Microservice with Docker
## 🚀2. Start Microservice with Docker (Option 2)

### 2.1 Start Milvus Server

Please refer to this [readme](../../../vectorstores/langchain/milvus/README.md).

### Build Docker Image
### 2.2 Build Docker Image

```bash
cd ../../../../
cd ../../..
# build mosec embedding docker image
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -t opea/embedding-langchain-mosec-endpoint:latest -f comps/embeddings/langchain-mosec/mosec-docker/Dockerfile .
# build dataprep milvus docker image
docker build -t opea/dataprep-milvus:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg no_proxy=$no_proxy -f comps/dataprep/milvus/docker/Dockerfile .
```

### Run Docker with CLI
### 2.3 Setup Environment Variables

```bash
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
export MILVUS=${your_host_ip}
```

### 2.3 Run Docker with CLI (Option A)

```bash
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS=${MILVUS} opea/dataprep-milvus:latest
```

### 2.4 Run with Docker Compose (Option B)

```bash
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint} -e MILVUS=${your_milvus_host_ip} opea/dataprep-milvus:latest
cd docker
docker compose -f docker-compose-dataprep-milvus.yaml up -d
```

## Invoke Microservice
## 🚀3. Consume Microservice

### 3.1 Consume Upload API

Once document preparation microservice for Milvus is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

Expand All @@ -65,13 +111,13 @@ curl -X POST \
http://localhost:6010/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands.
You can specify chunk_size and chunk_size by the following commands. To avoid big chunks, pass a small chun_size like 500 as below (default 1500).

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file.pdf" \
-F "chunk_size=1500" \
-F "chunk_size=500" \
-F "chunk_overlap=100" \
http://localhost:6010/v1/dataprep
```
Expand Down Expand Up @@ -132,3 +178,70 @@ Note: If you specify "table_strategy=llm", You should first start TGI Service, p
```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/home/user/doc/your_document_name","process_table":true,"table_strategy":"hq"}' http://localhost:6010/v1/dataprep
```

### 3.2 Consume get_file API

To get uploaded file structures, use the following command:

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6010/v1/dataprep/get_file
```

Then you will get the response JSON like this:

```json
[
{
"name": "uploaded_file_1.txt",
"id": "uploaded_file_1.txt",
"type": "File",
"parent": ""
},
{
"name": "uploaded_file_2.txt",
"id": "uploaded_file_2.txt",
"type": "File",
"parent": ""
}
]
```

### 3.3 Consume delete_file API

To delete uploaded file/link, use the following command.

The `file_path` here should be the `id` get from `/v1/dataprep/get_file` API.

```bash
# delete link
curl -X POST \
-H "Content-Type: application/json" \
-d '{"file_path": "https://www.ces.tech/.txt"}' \
http://localhost:6007/v1/dataprep/delete_file

# delete file
curl -X POST \
-H "Content-Type: application/json" \
-d '{"file_path": "uploaded_file_1.txt"}' \
http://localhost:6007/v1/dataprep/delete_file

# delete all files and links, will drop the entire db collection
curl -X POST \
-H "Content-Type: application/json" \
-d '{"file_path": "all"}' \
http://localhost:6007/v1/dataprep/delete_file
```

## 🚀4. Troubleshooting

1. If you get errors from Mosec Embedding Endpoint like `cannot find this task, maybe it has expired` while uploading files, try to reduce the `chunk_size` in the curl command like below (the default chunk_size=1500).

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file.pdf" \
-F "chunk_size=500" \
http://localhost:6010/v1/dataprep
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3"
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
healthcheck:
test: ["CMD", "etcdctl", "endpoint", "health"]
interval: 30s
timeout: 20s
retries: 3

minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
ports:
- "9001:9001"
- "9000:9000"
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3

standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.4.6
command: ["milvus", "run", "standalone"]
security_opt:
- seccomp:unconfined
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/milvus.yaml:/milvus/configs/milvus.yaml
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
start_period: 90s
timeout: 20s
retries: 3
ports:
- "19530:19530"
- "9091:9091"
depends_on:
- "etcd"
- "minio"

mosec-embedding:
image: opea/embedding-mosec-endpoint:latest
container_name: embedding-mosec-server
ports:
- "6009:8000"
ipc: host
environment:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped

dataprep-milvus:
image: opea/dataprep-milvus:latest
container_name: dataprep-milvus-server
ports:
- "6010:6010"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
MOSEC_EMBEDDING_ENDPOINT: ${MOSEC_EMBEDDING_ENDPOINT}
MILVUS: ${MILVUS}
restart: unless-stopped

networks:
default:
driver: bridge
Loading