Skip to content

Commit

Permalink
Update Dataprep with Parameter Settings (opea-project#351)
Browse files Browse the repository at this point in the history
* update dataprep with parameter settings

Signed-off-by: letonghan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update port

Signed-off-by: letonghan <[email protected]>

---------

Signed-off-by: letonghan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: sharanshirodkar7 <[email protected]>
  • Loading branch information
2 people authored and sharanshirodkar7 committed Aug 7, 2024
1 parent 5456dd3 commit fd7edd3
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 1 deletion.
30 changes: 30 additions & 0 deletions comps/dataprep/redis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,32 @@ export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
export PYTHONPATH=${path_to_comps}
```

## 1.4 Start Embedding Service

First, you need to start a TEI service.

```bash
your_port=6006
model="BAAI/bge-large-en-v1.5"
revision="refs/pr/5"
docker run -p $your_port:80 -v ./data:/data --name tei_server -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
```

Then you need to test your TEI service using the following commands:

```bash
curl localhost:$your_port/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
```

After checking that it works, set up environment variables.

```bash
export TEI_ENDPOINT="http://localhost:$your_port"
```

## 1.4 Start Document Preparation Microservice for Redis with Python Script

Start document preparation microservice for Redis with below command.
Expand All @@ -69,6 +95,10 @@ Please refer to this [readme](../../vectorstores/langchain/redis/README.md).
## 2.2 Setup Environment Variables

```bash
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
export TEI_ENDPOINT="http://${your_ip}:6006"
export REDIS_HOST=${your_ip}
export REDIS_PORT=6379
export REDIS_URL="redis://${your_ip}:6379"
export INDEX_NAME=${your_index_name}
export LANGCHAIN_TRACING_V2=true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@ services:
ports:
- "6379:6379"
- "8001:8001"
tei-embedding-service:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
container_name: tei-embedding-server
ports:
- "6006:80"
volumes:
- "./data:/data"
shm_size: 1g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate
dataprep-redis:
image: opea/dataprep-redis:latest
container_name: dataprep-redis-server
Expand All @@ -21,6 +34,8 @@ services:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
REDIS_HOST: ${REDIS_HOST}
REDIS_PORT: ${REDIS_PORT}
REDIS_URL: ${REDIS_URL}
INDEX_NAME: ${INDEX_NAME}
TEI_ENDPOINT: ${TEI_ENDPOINT}
Expand Down
5 changes: 4 additions & 1 deletion comps/dataprep/redis/langchain/prepare_doc_redis.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,10 @@ def ingest_data_to_redis(doc_path: DocPath):
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
else:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=doc_path.chunk_size, chunk_overlap=100, add_start_index=True, separators=get_separators()
chunk_size=doc_path.chunk_size,
chunk_overlap=doc_path.chunk_overlap,
add_start_index=True,
separators=get_separators(),
)

content = document_loader(path)
Expand Down

0 comments on commit fd7edd3

Please sign in to comment.