Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: Pathway vector store and retriever as LangChain component #342

Merged
merged 36 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
c126649
nb
berkecanrizai Jul 23, 2024
2c6fd04
init changes
berkecanrizai Jul 23, 2024
1eff95a
docker
berkecanrizai Jul 23, 2024
ac192a6
example data
berkecanrizai Jul 23, 2024
0d893f8
docs(readme): update, add commands
berkecanrizai Jul 23, 2024
b7cd748
fix: formatting, data sources
berkecanrizai Jul 24, 2024
0dbfeca
docs(readme): update instructions, add comments
berkecanrizai Jul 24, 2024
652b007
fix: rm unused parts
berkecanrizai Jul 24, 2024
98657f3
fix: image name, compose env vars
berkecanrizai Jul 24, 2024
7c729c3
fix: rm unused part
berkecanrizai Jul 24, 2024
9f1b879
fix: logging name
berkecanrizai Jul 24, 2024
6a0e849
fix: env var
berkecanrizai Jul 24, 2024
69d4cdc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 24, 2024
e36e7dc
fix: rename pw docker
berkecanrizai Jul 29, 2024
7127eb1
docs(readme): update input sources
berkecanrizai Jul 29, 2024
bbb59d2
nb
berkecanrizai Jul 23, 2024
0071d37
init changes
berkecanrizai Jul 23, 2024
9d8ad10
fix: formatting, data sources
berkecanrizai Jul 24, 2024
3aae9b1
docs(readme): update instructions, add comments
berkecanrizai Jul 24, 2024
28bf03b
fix: rm unused part
berkecanrizai Jul 24, 2024
732d4ee
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 24, 2024
7788a3e
fix: rename pw docker
berkecanrizai Jul 29, 2024
0c29407
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 29, 2024
5eb14d2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 29, 2024
b3d8ea4
Merge branch 'main' into main
Spycsh Aug 20, 2024
4aa4ffc
feat: mv vector store, naming, clarify instructions, improve ingestio…
berkecanrizai Aug 22, 2024
4d2bf40
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 22, 2024
f7555c5
tests: add pw retriever test
berkecanrizai Aug 22, 2024
0f2b88e
Merge branch 'main' of https://github.com/berkecanrizai/GenAIComps
berkecanrizai Aug 22, 2024
f19e3d5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 22, 2024
8ec79ee
implement suggestions from review, entrypoint, reqs, comments, https_…
berkecanrizai Aug 23, 2024
30c3eee
Merge branch 'main' of https://github.com/berkecanrizai/GenAIComps
berkecanrizai Aug 23, 2024
32e1158
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 23, 2024
807848f
fix: update docker tags in test and readme
berkecanrizai Aug 28, 2024
6853a14
Merge branch 'main' of https://github.com/berkecanrizai/GenAIComps
berkecanrizai Aug 28, 2024
05f8b6a
tests: add separate pathway vectorstore test
berkecanrizai Aug 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions comps/retrievers/langchain/pathway/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Retriever Microservice with Pathway

## 🚀Start Microservices

### With the Docker CLI

We suggest using `docker compose` to run this app, refer to [`docker compose`](#with-the-docker-compose) section below.

If you prefer to run them separately, refer to this section.

#### (Optionally) Start the TEI (embedder) service separately

> Note that Docker compose will start this service as well, this step is thus optional.

```bash
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/retriever"
model=BAAI/bge-base-en-v1.5
revision=refs/pr/4
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"

# then run:
docker run -p 6060:80 -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
```

Health check the embedding service with:

```bash
curl 127.0.0.1:6060/embed -X POST -d '{"inputs":"What is Deep Learning?"}' -H 'Content-Type: application/json'
```

If the model supports re-ranking, you can also use:

```bash
curl 127.0.0.1:6060/rerank -X POST -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' -H 'Content-Type: application/json'
```

#### Start Retriever Service

Retriever service queries the Pathway vector store on incoming requests.
Make sure that Pathway vector store is already running, [see Pathway vector store here](../../../vectorstores/langchain/pathway/README.md).

Retriever service expects the Pathway host and port variables to connect to the vector DB. Set the Pathway vector store environment variables.

```bash
export PATHWAY_HOST=0.0.0.0
export PATHWAY_PORT=8666
```

```bash
# make sure you are in the root folder of the repo
docker build -t opea/retriever-pathway:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/langchain/pathway/docker/Dockerfile .

docker run -p 7000:7000 -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy --network="host" opea/retriever-pathway:latest
```

### With the Docker compose

First, set the env variables:

```bash
export PATHWAY_HOST=0.0.0.0
export PATHWAY_PORT=8666
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/retriever"
model=BAAI/bge-base-en-v1.5
revision=refs/pr/4
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
```

Text embeddings inference service expects the `RETRIEVE_MODEL_ID` variable to be set.

```bash
export RETRIEVE_MODEL_ID=BAAI/bge-base-en-v1.5
```

Note that following docker compose sets the `network_mode: host` in retriever image to allow local vector store connection.
This will start the both the embedding and retriever services:

```bash
cd comps/retrievers/langchain/pathway/docker

docker compose -f docker_compose_retriever.yaml build
docker compose -f docker_compose_retriever.yaml up

# shut down the containers
docker compose -f docker_compose_retriever.yaml down
```

Make sure the retriever service is working as expected:

```bash
curl http://0.0.0.0:7000/v1/health_check -X GET -H 'Content-Type: application/json'
```

send an example query:

```bash
exm_embeddings=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")

curl http://0.0.0.0:7000/v1/retrieval -X POST -d "{\"text\":\"What is the revenue of Nike in 2023?\",\"embedding\":${exm_embeddings}}" -H 'Content-Type: application/json'
```
30 changes: 30 additions & 0 deletions comps/retrievers/langchain/pathway/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM langchain/langchain:latest

ARG ARCH="cpu"

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

COPY comps /home/user/comps

USER user

RUN pip install --no-cache-dir --upgrade pip && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/retrievers/langchain/pathway/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/retrievers/langchain/pathway

ENTRYPOINT ["bash", "entrypoint.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3.8"

services:
tei_xeon_service:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
container_name: tei-xeon-server
ports:
- "6060:80"
volumes:
- "./data:/data"
shm_size: 1g
command: --model-id ${RETRIEVE_MODEL_ID}
retriever:
image: opea/retriever-pathway:latest
container_name: retriever-pathway-server
ports:
- "7000:7000"
ipc: host
network_mode: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PATHWAY_HOST: ${PATHWAY_HOST}
PATHWAY_PORT: ${PATHWAY_PORT}
TEI_EMBEDDING_ENDPOINT: ${TEI_EMBEDDING_ENDPOINT}
LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
restart: unless-stopped

networks:
default:
driver: bridge
6 changes: 6 additions & 0 deletions comps/retrievers/langchain/pathway/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

pip --no-cache-dir install -r requirements-runtime.txt

python retriever_pathway.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
langsmith
12 changes: 12 additions & 0 deletions comps/retrievers/langchain/pathway/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
docarray[full]
fastapi
frontend==0.0.3
huggingface_hub
langchain_community == 0.2.0
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-sdk
pathway
prometheus-fastapi-instrumentator
sentence_transformers
shortuuid
52 changes: 52 additions & 0 deletions comps/retrievers/langchain/pathway/retriever_pathway.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Copyright (C) 2024 Intel Corporation
kevinintel marked this conversation as resolved.
Show resolved Hide resolved
# SPDX-License-Identifier: Apache-2.0

import os
import time

from langchain_community.vectorstores import PathwayVectorClient
from langsmith import traceable

from comps import (
EmbedDoc,
SearchedDoc,
ServiceType,
TextDoc,
opea_microservices,
register_microservice,
register_statistics,
statistics_dict,
)

host = os.getenv("PATHWAY_HOST", "127.0.0.1")
port = int(os.getenv("PATHWAY_PORT", 8666))

EMBED_MODEL = os.getenv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")

tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT")


@register_microservice(
name="opea_service@retriever_pathway",
service_type=ServiceType.RETRIEVER,
endpoint="/v1/retrieval",
host="0.0.0.0",
port=7000,
)
@traceable(run_type="retriever")
@register_statistics(names=["opea_service@retriever_pathway"])
def retrieve(input: EmbedDoc) -> SearchedDoc:
start = time.time()
documents = pw_client.similarity_search(input.text, input.fetch_k)

docs = [TextDoc(text=r.page_content) for r in documents]

time_spent = time.time() - start
statistics_dict["opea_service@retriever_pathway"].append_latency(time_spent, None) # noqa: E501
return SearchedDoc(retrieved_docs=docs, initial_query=input.text)


if __name__ == "__main__":
# Create the vectorstore client
pw_client = PathwayVectorClient(host=host, port=port)
opea_microservices["opea_service@retriever_pathway"].start()
4 changes: 4 additions & 0 deletions comps/vectorstores/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,7 @@ For details, please refer to this [readme](langchain/pgvector/README.md)
## Vectorstores Microservice with Pinecone

For details, please refer to this [readme](langchain/pinecone/README.md)

## Vectorstores Microservice with Pathway

For details, please refer to this [readme](langchain/pathway/README.md)
25 changes: 25 additions & 0 deletions comps/vectorstores/langchain/pathway/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM pathwaycom/pathway:0.13.2-slim

ENV DOCKER_BUILDKIT=1
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
poppler-utils \
libreoffice \
libmagic-dev \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/

RUN pip install --no-cache-dir -r requirements.txt

COPY vectorstore_pathway.py /app/


CMD ["python", "vectorstore_pathway.py"]

84 changes: 84 additions & 0 deletions comps/vectorstores/langchain/pathway/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Start the Pathway Vector DB Server

Set the environment variables for Pathway, and the embedding model.

> Note: If you are using `TEI_EMBEDDING_ENDPOINT`, make sure embedding service is already running.
> See the instructions under [here](../../../retrievers/langchain/pathway/README.md)

```bash
export PATHWAY_HOST=0.0.0.0
export PATHWAY_PORT=8666
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # uncomment if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
```

## Configuration

### Setting up the Pathway data sources

Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage, and any data stream. Whenever a new file is added or an existing file is modified, Pathway parses, chunks and indexes the documents in real-time.

See [pathway-io](https://pathway.com/developers/api-docs/pathway-io) for more information.

You can easily connect to the data inside the folder with the Pathway file system connector. The data will automatically be updated by Pathway whenever the content of the folder changes. In this example, we create a single data source that reads the files under the `./data` folder.

You can manage your data sources by configuring the `data_sources` in `vectorstore_pathway.py`.

```python
import pathway as pw

data = pw.io.fs.read(
"./data",
format="binary",
mode="streaming",
with_metadata=True,
) # This creates a Pathway connector that tracks
# all the files in the ./data directory

data_sources = [data]
```

### Other configs (parser, splitter and the embedder)

Pathway vectorstore handles the ingestion and processing of the documents.
This allows you to configure the parser, splitter and the embedder.
Whenever a file is added or modified in one of the sources, Pathway will automatically ingest the file.

By default, `ParseUnstructured` parser, `langchain.text_splitter.CharacterTextSplitter` splitter and `BAAI/bge-base-en-v1.5` embedder are used.

For more information, see the relevant Pathway docs:

- [Vector store docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/vectorstore)
- [parsers docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/parsers)
- [splitters docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/splitters)
- [embedders docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/embedders)

## Building and running

Build the Docker and run the Pathway Vector Store:

```bash
cd comps/vectorstores/langchain/pathway

docker build --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -t opea/vectorstore-pathway:latest .

# with locally loaded model, you may add `EMBED_MODEL` env variable to configure the model.
docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} opea/vectorstore-pathway:latest

# with the hosted embedder (network argument is needed for the vector server to reach to the embedding service)
docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e TEI_EMBEDDING_ENDPOINT=${TEI_EMBEDDING_ENDPOINT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} --network="host" opea/vectorstore-pathway:latest
```

## Health check the vector store

Wait until the server finishes indexing the docs, and send the following request to check it.

```bash
curl -X 'POST' \
"http://$PATHWAY_HOST:$PATHWAY_PORT/v1/statistics" \
-H 'accept: */*' \
-H 'Content-Type: application/json'
```

This should respond with something like:

> `{"file_count": 1, "last_indexed": 1724325093, "last_modified": 1724317365}`
Binary file not shown.
4 changes: 4 additions & 0 deletions comps/vectorstores/langchain/pathway/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
langchain_openai
pathway[xpack-llm] >= 0.14.1
sentence_transformers
unstructured[all-docs] >= 0.10.28,<0.15
Loading
Loading