Skip to content

Commit

Permalink
MultimodalQnA Image and Audio Support Phase 1 (#1071)
Browse files Browse the repository at this point in the history
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: okhleif-IL <[email protected]>
Signed-off-by: dmsuehir <[email protected]>
Co-authored-by: Omar Khleif <[email protected]>
Co-authored-by: dmsuehir <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
  • Loading branch information
5 people authored Nov 8, 2024
1 parent dd9623d commit bbc95bb
Show file tree
Hide file tree
Showing 15 changed files with 471 additions and 155 deletions.
12 changes: 7 additions & 5 deletions MultimodalQnA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.

`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.

The MultimodalQnA architecture shows below:

Expand Down Expand Up @@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component

By default, the embedding and LVM models are set to a default value as listed below:

| Service | Model |
| -------------------- | ------------------------------------------- |
| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
| LVM | llava-hf/llava-v1.6-vicuna-13b-hf |
| Service | HW | Model |
| -------------------- | ----- | ----------------------------------------- |
| embedding-multimodal | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |

You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.

Expand Down
50 changes: 37 additions & 13 deletions MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
```

Note: Please replace with `host_ip` with you external IP address, do not use localhost.
Expand Down Expand Up @@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \

6. dataprep-multimodal-redis

Download a sample video
Download a sample video, image, and audio file and create a caption

```bash
export video_fn="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}

export image_fn="apple.png"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}

export caption_fn="apple.txt"
echo "This is an apple." > ${caption_fn}

export audio_fn="AudioSample.wav"
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
```

Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST \
-F "files=@./${video_fn}" \
-F "files=@./${audio_fn}"
```

Also, test dataprep microservice with generating caption using lvm microservice
Also, test dataprep microservice with generating an image caption using lvm microservice

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST -F "files=@./${image_fn}"
```

Now, test the microservice with posting a custom caption along with an image

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
```

Also, you are able to get the list of all videos that you uploaded:
Also, you are able to get the list of all files that you uploaded:

```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_GET_VIDEO_ENDPOINT}
${DATAPREP_GET_FILE_ENDPOINT}
```

Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.

```bash
[
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
]
```
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_DELETE_VIDEO_ENDPOINT}
${DATAPREP_DELETE_FILE_ENDPOINT}
```
7. MegaService
Expand Down
3 changes: 3 additions & 0 deletions MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ services:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PORT: ${EMBEDDER_PORT}
entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
restart: unless-stopped
embedding-multimodal:
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
Expand Down Expand Up @@ -76,6 +77,7 @@ services:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID]
restart: unless-stopped
lvm-llava-svc:
image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest}
Expand Down Expand Up @@ -125,6 +127,7 @@ services:
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
ipc: host
Expand Down
6 changes: 4 additions & 2 deletions MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
51 changes: 36 additions & 15 deletions MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
```

Note: Please replace with `host_ip` with you external IP address, do not use localhost.
Expand Down Expand Up @@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \

6. Multimodal Dataprep Microservice

Download a sample video
Download a sample video, image, and audio file and create a caption

```bash
export video_fn="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
```

Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
export image_fn="apple.png"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}

export caption_fn="apple.txt"
echo "This is an apple." > ${caption_fn}

export audio_fn="AudioSample.wav"
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
```

Test dataprep microservice with generating transcript using whisper model
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST \
-F "files=@./${video_fn}" \
-F "files=@./${audio_fn}"
```

Also, test dataprep microservice with generating caption using lvm-tgi
Also, test dataprep microservice with generating an image caption using lvm-tgi

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST -F "files=@./${image_fn}"
```

Now, test the microservice with posting a custom caption along with an image

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
```

Also, you are able to get the list of all videos that you uploaded:
Also, you are able to get the list of all files that you uploaded:

```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_GET_VIDEO_ENDPOINT}
${DATAPREP_GET_FILE_ENDPOINT}
```

Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.

```bash
[
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
]
```
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_DELETE_VIDEO_ENDPOINT}
${DATAPREP_DELETE_FILE_ENDPOINT}
```
7. MegaService
Expand Down
2 changes: 2 additions & 0 deletions MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ services:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PORT: ${EMBEDDER_PORT}
entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
restart: unless-stopped
embedding-multimodal:
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
Expand Down Expand Up @@ -139,6 +140,7 @@ services:
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
ipc: host
Expand Down
5 changes: 3 additions & 2 deletions MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
Loading

0 comments on commit bbc95bb

Please sign in to comment.