From bbc95bb70881309efe9feae83205cc04facc21bd Mon Sep 17 00:00:00 2001 From: Melanie Hart Buehler Date: Thu, 7 Nov 2024 23:54:49 -0800 Subject: [PATCH] MultimodalQnA Image and Audio Support Phase 1 (#1071) Signed-off-by: Melanie Buehler Signed-off-by: okhleif-IL Signed-off-by: dmsuehir Co-authored-by: Omar Khleif Co-authored-by: dmsuehir Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> --- MultimodalQnA/README.md | 12 +- .../docker_compose/intel/cpu/xeon/README.md | 50 ++- .../intel/cpu/xeon/compose.yaml | 3 + .../docker_compose/intel/cpu/xeon/set_env.sh | 6 +- .../docker_compose/intel/hpu/gaudi/README.md | 51 ++- .../intel/hpu/gaudi/compose.yaml | 2 + .../docker_compose/intel/hpu/gaudi/set_env.sh | 5 +- MultimodalQnA/tests/test_compose_on_gaudi.sh | 69 +++- MultimodalQnA/tests/test_compose_on_xeon.sh | 74 +++-- MultimodalQnA/ui/gradio/conversation.py | 5 + .../ui/gradio/multimodalqna_ui_gradio.py | 297 ++++++++++++++---- MultimodalQnA/ui/gradio/utils.py | 13 + README.md | 25 +- docker_images_list.md | 4 +- supported_examples.md | 10 +- 15 files changed, 471 insertions(+), 155 deletions(-) diff --git a/MultimodalQnA/README.md b/MultimodalQnA/README.md index 95626aa78..08de5686a 100644 --- a/MultimodalQnA/README.md +++ b/MultimodalQnA/README.md @@ -2,7 +2,7 @@ Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose. -`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user. +`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user. The MultimodalQnA architecture shows below: @@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component By default, the embedding and LVM models are set to a default value as listed below: -| Service | Model | -| -------------------- | ------------------------------------------- | -| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi | -| LVM | llava-hf/llava-v1.6-vicuna-13b-hf | +| Service | HW | Model | +| -------------------- | ----- | ----------------------------------------- | +| embedding-multimodal | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc | +| LVM | Xeon | llava-hf/llava-1.5-7b-hf | +| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc | +| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf | You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed. diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md index 9b3a3edaa..d0a1c7d27 100644 --- a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md +++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md @@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis" export LLAVA_SERVER_PORT=8399 export LVM_ENDPOINT="http://${host_ip}:8399" export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc" +export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf" export WHISPER_MODEL="base" export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip} export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip} export LVM_SERVICE_HOST_IP=${host_ip} export MEGA_SERVICE_HOST_IP=${host_ip} export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna" +export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text" export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts" export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions" -export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos" -export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos" +export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files" +export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files" ``` Note: Please replace with `host_ip` with you external IP address, do not use localhost. @@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \ 6. dataprep-multimodal-redis -Download a sample video +Download a sample video, image, and audio file and create a caption ```bash export video_fn="WeAreGoingOnBullrun.mp4" wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn} + +export image_fn="apple.png" +wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn} + +export caption_fn="apple.txt" +echo "This is an apple." > ${caption_fn} + +export audio_fn="AudioSample.wav" +wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn} ``` -Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4. +Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file. ```bash curl --silent --write-out "HTTPSTATUS:%{http_code}" \ ${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \ -H 'Content-Type: multipart/form-data' \ - -X POST -F "files=@./${video_fn}" + -X POST \ + -F "files=@./${video_fn}" \ + -F "files=@./${audio_fn}" ``` -Also, test dataprep microservice with generating caption using lvm microservice +Also, test dataprep microservice with generating an image caption using lvm microservice ```bash curl --silent --write-out "HTTPSTATUS:%{http_code}" \ ${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \ -H 'Content-Type: multipart/form-data' \ - -X POST -F "files=@./${video_fn}" + -X POST -F "files=@./${image_fn}" +``` + +Now, test the microservice with posting a custom caption along with an image + +```bash +curl --silent --write-out "HTTPSTATUS:%{http_code}" \ + ${DATAPREP_INGEST_SERVICE_ENDPOINT} \ + -H 'Content-Type: multipart/form-data' \ + -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}" ``` -Also, you are able to get the list of all videos that you uploaded: +Also, you are able to get the list of all files that you uploaded: ```bash curl -X POST \ -H "Content-Type: application/json" \ - ${DATAPREP_GET_VIDEO_ENDPOINT} + ${DATAPREP_GET_FILE_ENDPOINT} ``` -Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`. +Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`. ```bash [ "WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4", - "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4" + "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4", + "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png", + "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav ] ``` -To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS. +To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS. ```bash curl -X POST \ -H "Content-Type: application/json" \ - ${DATAPREP_DELETE_VIDEO_ENDPOINT} + ${DATAPREP_DELETE_FILE_ENDPOINT} ``` 7. MegaService diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml b/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml index d9bf3bce9..eece99da8 100644 --- a/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml +++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/compose.yaml @@ -36,6 +36,7 @@ services: http_proxy: ${http_proxy} https_proxy: ${https_proxy} PORT: ${EMBEDDER_PORT} + entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID] restart: unless-stopped embedding-multimodal: image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest} @@ -76,6 +77,7 @@ services: no_proxy: ${no_proxy} http_proxy: ${http_proxy} https_proxy: ${https_proxy} + entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID] restart: unless-stopped lvm-llava-svc: image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest} @@ -125,6 +127,7 @@ services: - https_proxy=${https_proxy} - http_proxy=${http_proxy} - BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT} + - DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT} - DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} - DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} ipc: host diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh index ca5e650ff..d8824fb0b 100755 --- a/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh +++ b/MultimodalQnA/docker_compose/intel/cpu/xeon/set_env.sh @@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis" export LLAVA_SERVER_PORT=8399 export LVM_ENDPOINT="http://${host_ip}:8399" export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc" +export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf" export WHISPER_MODEL="base" export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip} export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip} export LVM_SERVICE_HOST_IP=${host_ip} export MEGA_SERVICE_HOST_IP=${host_ip} export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna" +export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text" export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts" export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions" -export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos" -export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos" +export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files" +export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files" diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md index 6517b100c..6d6ca88ff 100644 --- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md +++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md @@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip} export LVM_SERVICE_HOST_IP=${host_ip} export MEGA_SERVICE_HOST_IP=${host_ip} export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna" +export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text" export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts" export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions" -export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos" -export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos" +export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files" +export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files" ``` Note: Please replace with `host_ip` with you external IP address, do not use localhost. @@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \ 6. Multimodal Dataprep Microservice -Download a sample video +Download a sample video, image, and audio file and create a caption ```bash export video_fn="WeAreGoingOnBullrun.mp4" wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn} -``` -Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4. +export image_fn="apple.png" +wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn} + +export caption_fn="apple.txt" +echo "This is an apple." > ${caption_fn} + +export audio_fn="AudioSample.wav" +wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn} +``` -Test dataprep microservice with generating transcript using whisper model +Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file. ```bash curl --silent --write-out "HTTPSTATUS:%{http_code}" \ ${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \ -H 'Content-Type: multipart/form-data' \ - -X POST -F "files=@./${video_fn}" + -X POST \ + -F "files=@./${video_fn}" \ + -F "files=@./${audio_fn}" ``` -Also, test dataprep microservice with generating caption using lvm-tgi +Also, test dataprep microservice with generating an image caption using lvm-tgi ```bash curl --silent --write-out "HTTPSTATUS:%{http_code}" \ ${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \ -H 'Content-Type: multipart/form-data' \ - -X POST -F "files=@./${video_fn}" + -X POST -F "files=@./${image_fn}" +``` + +Now, test the microservice with posting a custom caption along with an image + +```bash +curl --silent --write-out "HTTPSTATUS:%{http_code}" \ + ${DATAPREP_INGEST_SERVICE_ENDPOINT} \ + -H 'Content-Type: multipart/form-data' \ + -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}" ``` -Also, you are able to get the list of all videos that you uploaded: +Also, you are able to get the list of all files that you uploaded: ```bash curl -X POST \ -H "Content-Type: application/json" \ - ${DATAPREP_GET_VIDEO_ENDPOINT} + ${DATAPREP_GET_FILE_ENDPOINT} ``` -Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`. +Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`. ```bash [ "WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4", - "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4" + "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4", + "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png", + "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav ] ``` -To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS. +To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS. ```bash curl -X POST \ -H "Content-Type: application/json" \ - ${DATAPREP_DELETE_VIDEO_ENDPOINT} + ${DATAPREP_DELETE_FILE_ENDPOINT} ``` 7. MegaService diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml b/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml index d7ac74084..e66aea1f0 100644 --- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml +++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/compose.yaml @@ -36,6 +36,7 @@ services: http_proxy: ${http_proxy} https_proxy: ${https_proxy} PORT: ${EMBEDDER_PORT} + entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID] restart: unless-stopped embedding-multimodal: image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest} @@ -139,6 +140,7 @@ services: - https_proxy=${https_proxy} - http_proxy=${http_proxy} - BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT} + - DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT} - DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} - DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} ipc: host diff --git a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh index 211a1a696..b5be052e1 100755 --- a/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh +++ b/MultimodalQnA/docker_compose/intel/hpu/gaudi/set_env.sh @@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip} export LVM_SERVICE_HOST_IP=${host_ip} export MEGA_SERVICE_HOST_IP=${host_ip} export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna" +export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text" export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts" export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions" -export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos" -export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos" +export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files" +export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files" diff --git a/MultimodalQnA/tests/test_compose_on_gaudi.sh b/MultimodalQnA/tests/test_compose_on_gaudi.sh index dd7af39fb..3b629f52b 100644 --- a/MultimodalQnA/tests/test_compose_on_gaudi.sh +++ b/MultimodalQnA/tests/test_compose_on_gaudi.sh @@ -14,12 +14,13 @@ WORKPATH=$(dirname "$PWD") LOG_PATH="$WORKPATH/tests" ip_address=$(hostname -I | awk '{print $1}') +export image_fn="apple.png" export video_fn="WeAreGoingOnBullrun.mp4" +export caption_fn="apple.txt" function build_docker_images() { cd $WORKPATH/docker_image_build git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../ - echo "Build all the images with --no-cache, check docker_image_build.log for details..." service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-tgi dataprep-multimodal-redis" docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log @@ -40,17 +41,18 @@ function setup_env() { export LLAVA_SERVER_PORT=8399 export LVM_ENDPOINT="http://${host_ip}:8399" export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc" - export LVM_MODEL_ID="llava-hf/llava-v1.6-vicuna-13b-hf" + export LVM_MODEL_ID="llava-hf/llava-v1.6-vicuna-7b-hf" export WHISPER_MODEL="base" export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip} export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip} export LVM_SERVICE_HOST_IP=${host_ip} export MEGA_SERVICE_HOST_IP=${host_ip} export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna" + export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text" export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts" export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions" - export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos" - export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos" + export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files" + export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files" } function start_services() { @@ -63,12 +65,15 @@ function start_services() { function prepare_data() { cd $LOG_PATH - echo "Downloading video" + echo "Downloading image and video" + wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn} wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn} + echo "Writing caption file" + echo "This is an apple." > ${caption_fn} sleep 30s - } + function validate_service() { local URL="$1" local EXPECTED_RESULT="$2" @@ -76,9 +81,15 @@ function validate_service() { local DOCKER_NAME="$4" local INPUT_DATA="$5" - if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then + if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then cd $LOG_PATH HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL") + elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then + cd $LOG_PATH + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL") + elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then + cd $LOG_PATH + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL") elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL") elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then @@ -147,27 +158,34 @@ function validate_microservices() { sleep 1m # retrieval can't curl as expected, try to wait for more time # test data prep - echo "Data Prep with Generating Transcript" + echo "Data Prep with Generating Transcript for Video" validate_service \ "${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \ "Data preparation succeeded" \ - "dataprep-multimodal-redis" \ + "dataprep-multimodal-redis-transcript" \ "dataprep-multimodal-redis" - echo "Data Prep with Generating Transcript" + echo "Data Prep with Image & Caption Ingestion" validate_service \ - "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \ + "${DATAPREP_INGEST_SERVICE_ENDPOINT}" \ "Data preparation succeeded" \ - "dataprep-multimodal-redis" \ + "dataprep-multimodal-redis-ingest" \ "dataprep-multimodal-redis" - echo "Validating get file" + echo "Validating get file returns mp4" validate_service \ - "${DATAPREP_GET_VIDEO_ENDPOINT}" \ + "${DATAPREP_GET_FILE_ENDPOINT}" \ '.mp4' \ "dataprep_get" \ "dataprep-multimodal-redis" + echo "Validating get file returns png" + validate_service \ + "${DATAPREP_GET_FILE_ENDPOINT}" \ + '.png' \ + "dataprep_get" \ + "dataprep-multimodal-redis" + sleep 1m # multimodal retrieval microservice @@ -180,7 +198,7 @@ function validate_microservices() { "retriever-multimodal-redis" \ "{\"text\":\"test\",\"embedding\":${your_embedding}}" - sleep 10s + sleep 3m # llava server echo "Evaluating LLAVA tgi-gaudi" @@ -200,6 +218,14 @@ function validate_microservices() { "lvm-tgi" \ '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}' + # data prep requiring lvm + echo "Data Prep with Generating Caption for Image" + validate_service \ + "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \ + "Data preparation succeeded" \ + "dataprep-multimodal-redis-caption" \ + "dataprep-multimodal-redis" + sleep 1m } @@ -224,14 +250,22 @@ function validate_megaservice() { } function validate_delete { - echo "Validate data prep delete videos" + echo "Validate data prep delete files" validate_service \ - "${DATAPREP_DELETE_VIDEO_ENDPOINT}" \ + "${DATAPREP_DELETE_FILE_ENDPOINT}" \ '{"status":true}' \ "dataprep_del" \ "dataprep-multimodal-redis" } +function delete_data() { + cd $LOG_PATH + echo "Deleting image, video, and caption" + rm -rf ${image_fn} + rm -rf ${video_fn} + rm -rf ${caption_fn} +} + function stop_docker() { cd $WORKPATH/docker_compose/intel/hpu/gaudi docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f @@ -256,6 +290,7 @@ function main() { validate_delete echo "==== delete validated ====" + delete_data stop_docker echo y | docker system prune diff --git a/MultimodalQnA/tests/test_compose_on_xeon.sh b/MultimodalQnA/tests/test_compose_on_xeon.sh index 46042c600..7d3ab0fae 100644 --- a/MultimodalQnA/tests/test_compose_on_xeon.sh +++ b/MultimodalQnA/tests/test_compose_on_xeon.sh @@ -14,7 +14,9 @@ WORKPATH=$(dirname "$PWD") LOG_PATH="$WORKPATH/tests" ip_address=$(hostname -I | awk '{print $1}') +export image_fn="apple.png" export video_fn="WeAreGoingOnBullrun.mp4" +export caption_fn="apple.txt" function build_docker_images() { cd $WORKPATH/docker_image_build @@ -37,6 +39,7 @@ function setup_env() { export INDEX_NAME="mm-rag-redis" export LLAVA_SERVER_PORT=8399 export LVM_ENDPOINT="http://${host_ip}:8399" + export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf" export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc" export WHISPER_MODEL="base" export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip} @@ -44,10 +47,11 @@ function setup_env() { export LVM_SERVICE_HOST_IP=${host_ip} export MEGA_SERVICE_HOST_IP=${host_ip} export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna" + export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text" export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts" export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions" - export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos" - export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos" + export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files" + export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files" } function start_services() { @@ -61,12 +65,14 @@ function start_services() { function prepare_data() { cd $LOG_PATH - echo "Downloading video" + echo "Downloading image and video" + wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn} wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn} - + echo "Writing caption file" + echo "This is an apple." > ${caption_fn} sleep 1m - } + function validate_service() { local URL="$1" local EXPECTED_RESULT="$2" @@ -74,9 +80,15 @@ function validate_service() { local DOCKER_NAME="$4" local INPUT_DATA="$5" - if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then + if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then cd $LOG_PATH HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL") + elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then + cd $LOG_PATH + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL") + elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then + cd $LOG_PATH + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL") elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL") elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then @@ -145,27 +157,34 @@ function validate_microservices() { sleep 1m # retrieval can't curl as expected, try to wait for more time # test data prep - echo "Data Prep with Generating Transcript" + echo "Data Prep with Generating Transcript for Video" validate_service \ "${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \ "Data preparation succeeded" \ - "dataprep-multimodal-redis" \ + "dataprep-multimodal-redis-transcript" \ "dataprep-multimodal-redis" - # echo "Data Prep with Generating Caption" - # validate_service \ - # "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \ - # "Data preparation succeeded" \ - # "dataprep-multimodal-redis" \ - # "dataprep-multimodal-redis" + echo "Data Prep with Image & Caption Ingestion" + validate_service \ + "${DATAPREP_INGEST_SERVICE_ENDPOINT}" \ + "Data preparation succeeded" \ + "dataprep-multimodal-redis-ingest" \ + "dataprep-multimodal-redis" - echo "Validating get file" + echo "Validating get file returns mp4" validate_service \ - "${DATAPREP_GET_VIDEO_ENDPOINT}" \ + "${DATAPREP_GET_FILE_ENDPOINT}" \ '.mp4' \ "dataprep_get" \ "dataprep-multimodal-redis" + echo "Validating get file returns png" + validate_service \ + "${DATAPREP_GET_FILE_ENDPOINT}" \ + '.png' \ + "dataprep_get" \ + "dataprep-multimodal-redis" + sleep 1m # multimodal retrieval microservice @@ -178,7 +197,7 @@ function validate_microservices() { "retriever-multimodal-redis" \ "{\"text\":\"test\",\"embedding\":${your_embedding}}" - sleep 10s + sleep 3m # llava server echo "Evaluating lvm-llava" @@ -198,6 +217,14 @@ function validate_microservices() { "lvm-llava-svc" \ '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}' + # data prep requiring lvm + echo "Data Prep with Generating Caption for Image" + validate_service \ + "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \ + "Data preparation succeeded" \ + "dataprep-multimodal-redis-caption" \ + "dataprep-multimodal-redis" + sleep 3m } @@ -222,14 +249,22 @@ function validate_megaservice() { } function validate_delete { - echo "Validate data prep delete videos" + echo "Validate data prep delete files" validate_service \ - "${DATAPREP_DELETE_VIDEO_ENDPOINT}" \ + "${DATAPREP_DELETE_FILE_ENDPOINT}" \ '{"status":true}' \ "dataprep_del" \ "dataprep-multimodal-redis" } +function delete_data() { + cd $LOG_PATH + echo "Deleting image, video, and caption" + rm -rf ${image_fn} + rm -rf ${video_fn} + rm -rf ${caption_fn} +} + function stop_docker() { cd $WORKPATH/docker_compose/intel/cpu/xeon docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f @@ -254,6 +289,7 @@ function main() { validate_delete echo "==== delete validated ====" + delete_data stop_docker echo y | docker system prune diff --git a/MultimodalQnA/ui/gradio/conversation.py b/MultimodalQnA/ui/gradio/conversation.py index 9f1a2827b..3057e9879 100644 --- a/MultimodalQnA/ui/gradio/conversation.py +++ b/MultimodalQnA/ui/gradio/conversation.py @@ -30,6 +30,7 @@ class Conversation: base64_frame: str = None skip_next: bool = False split_video: str = None + image: str = None def _template_caption(self): out = "" @@ -59,6 +60,8 @@ def get_prompt(self): else: base64_frame = get_b64_frame_from_timestamp(self.video_file, self.time_of_frame_ms) self.base64_frame = base64_frame + if base64_frame is None: + base64_frame = "" content.append({"type": "image_url", "image_url": {"url": base64_frame}}) else: content = message @@ -137,6 +140,7 @@ def dict(self): "caption": self.caption, "base64_frame": self.base64_frame, "split_video": self.split_video, + "image": self.image, } @@ -152,4 +156,5 @@ def dict(self): time_of_frame_ms=None, base64_frame=None, split_video=None, + image=None, ) diff --git a/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py b/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py index 3eba01a71..ec6a033ca 100644 --- a/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py +++ b/MultimodalQnA/ui/gradio/multimodalqna_ui_gradio.py @@ -13,7 +13,7 @@ from conversation import multimodalqna_conv from fastapi import FastAPI from fastapi.staticfiles import StaticFiles -from utils import build_logger, moderation_msg, server_error_msg, split_video +from utils import build_logger, make_temp_image, moderation_msg, server_error_msg, split_video logger = build_logger("gradio_web_server", "gradio_web_server.log") @@ -47,22 +47,24 @@ def clear_history(state, request: gr.Request): logger.info(f"clear_history. ip: {request.client.host}") if state.split_video and os.path.exists(state.split_video): os.remove(state.split_video) + if state.image and os.path.exists(state.image): + os.remove(state.image) state = multimodalqna_conv.copy() - return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 1 + return (state, state.to_gradio_chatbot(), None, None, None) + (disable_btn,) * 1 def add_text(state, text, request: gr.Request): logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}") if len(text) <= 0: state.skip_next = True - return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 1 + return (state, state.to_gradio_chatbot(), None) + (no_change_btn,) * 1 text = text[:2000] # Hard cut-off state.append_message(state.roles[0], text) state.append_message(state.roles[1], None) state.skip_next = False - return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 1 + return (state, state.to_gradio_chatbot(), None) + (disable_btn,) * 1 def http_bot(state, request: gr.Request): @@ -73,7 +75,7 @@ def http_bot(state, request: gr.Request): if state.skip_next: # This generate call is skipped due to invalid inputs path_to_sub_videos = state.get_path_to_subvideos() - yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (no_change_btn,) * 1 + yield (state, state.to_gradio_chatbot(), path_to_sub_videos, None) + (no_change_btn,) * 1 return if len(state.messages) == state.offset + 2: @@ -97,7 +99,7 @@ def http_bot(state, request: gr.Request): logger.info(f"==== url request ====\n{gateway_addr}") state.messages[-1][-1] = "▌" - yield (state, state.to_gradio_chatbot(), state.split_video) + (disable_btn,) * 1 + yield (state, state.to_gradio_chatbot(), state.split_video, state.image) + (disable_btn,) * 1 try: response = requests.post( @@ -108,6 +110,7 @@ def http_bot(state, request: gr.Request): ) print(response.status_code) print(response.json()) + if response.status_code == 200: response = response.json() choice = response["choices"][-1] @@ -123,44 +126,61 @@ def http_bot(state, request: gr.Request): video_file = metadata["source_video"] state.video_file = os.path.join(static_dir, metadata["source_video"]) state.time_of_frame_ms = metadata["time_of_frame_ms"] - try: - splited_video_path = split_video( - state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}" - ) - except: - print(f"video {state.video_file} does not exist in UI host!") - splited_video_path = None - state.split_video = splited_video_path + file_ext = os.path.splitext(state.video_file)[-1] + if file_ext == ".mp4": + try: + splited_video_path = split_video( + state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}" + ) + except: + print(f"video {state.video_file} does not exist in UI host!") + splited_video_path = None + state.split_video = splited_video_path + elif file_ext in [".jpg", ".jpeg", ".png", ".gif"]: + try: + output_image_path = make_temp_image(state.video_file, file_ext) + except: + print(f"image {state.video_file} does not exist in UI host!") + output_image_path = None + state.image = output_image_path + else: raise requests.exceptions.RequestException except requests.exceptions.RequestException as e: state.messages[-1][-1] = server_error_msg - yield (state, state.to_gradio_chatbot(), None) + (enable_btn,) + yield (state, state.to_gradio_chatbot(), None, None) + (enable_btn,) return state.messages[-1][-1] = message - yield (state, state.to_gradio_chatbot(), state.split_video) + (enable_btn,) * 1 + yield ( + state, + state.to_gradio_chatbot(), + gr.Video(state.split_video, visible=state.split_video is not None), + gr.Image(state.image, visible=state.image is not None), + ) + (enable_btn,) * 1 logger.info(f"{state.messages[-1][-1]}") return -def ingest_video_gen_transcript(filepath, request: gr.Request): - yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database...")) +def ingest_gen_transcript(filepath, filetype, request: gr.Request): + yield ( + gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...") + ) verified_filepath = os.path.normpath(filepath) if not verified_filepath.startswith(tmp_upload_folder): - print("Found malicious video file name!") + print(f"Found malicious {filetype} file name!") yield ( gr.Textbox( visible=True, - value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!", + value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.", ) ) return basename = os.path.basename(verified_filepath) dest = os.path.join(static_dir, basename) shutil.copy(verified_filepath, dest) - print("Done copy uploaded file to static folder!") + print("Done copying uploaded file to static folder.") headers = { # 'Content-Type': 'multipart/form-data' } @@ -172,17 +192,17 @@ def ingest_video_gen_transcript(filepath, request: gr.Request): if response.status_code == 200: response = response.json() print(response) - yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video...")) + yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}...")) time.sleep(2) fn_no_ext = Path(dest).stem - if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]: - new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext]) - print(response["video_id_maps"][fn_no_ext]) + if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]: + new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext]) + print(response["file_id_maps"][fn_no_ext]) os.rename(dest, new_dst) yield ( gr.Textbox( visible=True, - value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.", + value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.", ) ) return @@ -190,51 +210,53 @@ def ingest_video_gen_transcript(filepath, request: gr.Request): yield ( gr.Textbox( visible=True, - value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!", + value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.", ) ) time.sleep(2) return -def ingest_video_gen_caption(filepath, request: gr.Request): - yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database...")) +def ingest_gen_caption(filepath, filetype, request: gr.Request): + yield ( + gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...") + ) verified_filepath = os.path.normpath(filepath) if not verified_filepath.startswith(tmp_upload_folder): - print("Found malicious video file name!") + print(f"Found malicious {filetype} file name!") yield ( gr.Textbox( visible=True, - value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!", + value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.", ) ) return basename = os.path.basename(verified_filepath) dest = os.path.join(static_dir, basename) shutil.copy(verified_filepath, dest) - print("Done copy uploaded file to static folder!") + print("Done copying uploaded file to static folder.") headers = { # 'Content-Type': 'multipart/form-data' } files = { "files": open(dest, "rb"), } - response = requests.post(dataprep_gen_captiono_addr, headers=headers, files=files) + response = requests.post(dataprep_gen_caption_addr, headers=headers, files=files) print(response.status_code) if response.status_code == 200: response = response.json() print(response) - yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video...")) + yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}...")) time.sleep(2) fn_no_ext = Path(dest).stem - if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]: - new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext]) - print(response["video_id_maps"][fn_no_ext]) + if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]: + new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext]) + print(response["file_id_maps"][fn_no_ext]) os.rename(dest, new_dst) yield ( gr.Textbox( visible=True, - value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.", + value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.", ) ) return @@ -242,48 +264,181 @@ def ingest_video_gen_caption(filepath, request: gr.Request): yield ( gr.Textbox( visible=True, - value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!", + value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.", ) ) time.sleep(2) return -def clear_uploaded_video(request: gr.Request): +def ingest_with_text(filepath, text, request: gr.Request): + yield (gr.Textbox(visible=True, value="Please wait for your uploaded image to be ingested into the database...")) + verified_filepath = os.path.normpath(filepath) + if not verified_filepath.startswith(tmp_upload_folder): + print("Found malicious image file name!") + yield ( + gr.Textbox( + visible=True, + value="Your uploaded image's file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.", + ) + ) + return + basename = os.path.basename(verified_filepath) + dest = os.path.join(static_dir, basename) + shutil.copy(verified_filepath, dest) + text_basename = "{}.txt".format(os.path.splitext(basename)[0]) + text_dest = os.path.join(static_dir, text_basename) + with open(text_dest, "w") as file: + file.write(text) + print("Done copying uploaded files to static folder!") + headers = { + # 'Content-Type': 'multipart/form-data' + } + files = [("files", (basename, open(dest, "rb"))), ("files", (text_basename, open(text_dest, "rb")))] + try: + response = requests.post(dataprep_ingest_addr, headers=headers, files=files) + finally: + os.remove(text_dest) + print(response.status_code) + if response.status_code == 200: + response = response.json() + print(response) + yield (gr.Textbox(visible=True, value="Image ingestion is done. Saving your uploaded image...")) + time.sleep(2) + fn_no_ext = Path(dest).stem + if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]: + new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext]) + print(response["file_id_maps"][fn_no_ext]) + os.rename(dest, new_dst) + yield ( + gr.Textbox( + visible=True, + value="Congratulation! Your upload is done!\nClick the X button on the top right of the image upload box to upload another image.", + ) + ) + return + else: + yield ( + gr.Textbox( + visible=True, + value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the image upload box to reupload your image!", + ) + ) + time.sleep(2) + return + + +def hide_text(request: gr.Request): return gr.Textbox(visible=False) -with gr.Blocks() as upload_gen_trans: - gr.Markdown("# Ingest Your Own Video - Utilizing Generated Transcripts") - gr.Markdown( - "Please use this interface to ingest your own video if the video has meaningful audio (e.g., announcements, discussions, etc...)" - ) +def clear_text(request: gr.Request): + return None + + +with gr.Blocks() as upload_video: + gr.Markdown("# Ingest Your Own Video Using Generated Transcripts or Captions") + gr.Markdown("Use this interface to ingest your own video and generate transcripts or captions for it") + + def select_upload_type(choice, request: gr.Request): + if choice == "transcript": + return gr.Video(sources="upload", visible=True), gr.Video(sources="upload", visible=False) + else: + return gr.Video(sources="upload", visible=False), gr.Video(sources="upload", visible=True) + with gr.Row(): with gr.Column(scale=6): - video_upload = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload") + video_upload_trans = gr.Video(sources="upload", elem_id="video_upload_trans", visible=True) + video_upload_cap = gr.Video(sources="upload", elem_id="video_upload_cap", visible=False) with gr.Column(scale=3): + text_options_radio = gr.Radio( + [ + ("Generate transcript (video contains voice)", "transcript"), + ("Generate captions (video does not contain voice)", "caption"), + ], + label="Text Options", + info="How should text be ingested?", + value="transcript", + ) text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status") - video_upload.upload(ingest_video_gen_transcript, [video_upload], [text_upload_result]) - video_upload.clear(clear_uploaded_video, [], [text_upload_result]) + video_upload_trans.upload( + ingest_gen_transcript, [video_upload_trans, gr.Textbox(value="video", visible=False)], [text_upload_result] + ) + video_upload_trans.clear(hide_text, [], [text_upload_result]) + video_upload_cap.upload( + ingest_gen_caption, [video_upload_cap, gr.Textbox(value="video", visible=False)], [text_upload_result] + ) + video_upload_cap.clear(hide_text, [], [text_upload_result]) + text_options_radio.change(select_upload_type, [text_options_radio], [video_upload_trans, video_upload_cap]) -with gr.Blocks() as upload_gen_captions: - gr.Markdown("# Ingest Your Own Video - Utilizing Generated Captions") - gr.Markdown( - "Please use this interface to ingest your own video if the video has meaningless audio (e.g., background musics, etc...)" - ) +with gr.Blocks() as upload_image: + gr.Markdown("# Ingest Your Own Image Using Generated or Custom Captions/Labels") + gr.Markdown("Use this interface to ingest your own image and generate a caption for it") + + def select_upload_type(choice, request: gr.Request): + if choice == "gen_caption": + return gr.Image(sources="upload", visible=True), gr.Image(sources="upload", visible=False) + else: + return gr.Image(sources="upload", visible=False), gr.Image(sources="upload", visible=True) + + with gr.Row(): + with gr.Column(scale=6): + image_upload_cap = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=True) + image_upload_text = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=False) + with gr.Column(scale=3): + text_options_radio = gr.Radio( + [("Generate caption", "gen_caption"), ("Custom caption or label", "custom_caption")], + label="Text Options", + info="How should text be ingested?", + value="gen_caption", + ) + custom_caption = gr.Textbox(visible=True, interactive=True, label="Custom Caption or Label") + text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status") + image_upload_cap.upload( + ingest_gen_caption, [image_upload_cap, gr.Textbox(value="image", visible=False)], [text_upload_result] + ) + image_upload_cap.clear(hide_text, [], [text_upload_result]) + image_upload_text.upload(ingest_with_text, [image_upload_text, custom_caption], [text_upload_result]).then( + clear_text, [], [custom_caption] + ) + image_upload_text.clear(hide_text, [], [text_upload_result]) + text_options_radio.change(select_upload_type, [text_options_radio], [image_upload_cap, image_upload_text]) + +with gr.Blocks() as upload_audio: + gr.Markdown("# Ingest Your Own Audio Using Generated Transcripts") + gr.Markdown("Use this interface to ingest your own audio file and generate a transcript for it") with gr.Row(): with gr.Column(scale=6): - video_upload_cap = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload_cap") + audio_upload = gr.Audio(type="filepath") + with gr.Column(scale=3): + text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status") + audio_upload.upload( + ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result] + ) + audio_upload.stop_recording( + ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result] + ) + audio_upload.clear(hide_text, [], [text_upload_result]) + +with gr.Blocks() as upload_pdf: + gr.Markdown("# Ingest Your Own PDF") + gr.Markdown("Use this interface to ingest your own PDF file with text, tables, images, and graphs") + with gr.Row(): + with gr.Column(scale=6): + image_upload_cap = gr.File() with gr.Column(scale=3): text_upload_result_cap = gr.Textbox(visible=False, interactive=False, label="Upload Status") - video_upload_cap.upload(ingest_video_gen_transcript, [video_upload_cap], [text_upload_result_cap]) - video_upload_cap.clear(clear_uploaded_video, [], [text_upload_result_cap]) + image_upload_cap.upload( + ingest_gen_caption, [image_upload_cap, gr.Textbox(value="PDF", visible=False)], [text_upload_result_cap] + ) + image_upload_cap.clear(hide_text, [], [text_upload_result_cap]) with gr.Blocks() as qna: state = gr.State(multimodalqna_conv.copy()) with gr.Row(): with gr.Column(scale=4): - video = gr.Video(height=512, width=512, elem_id="video") + video = gr.Video(height=512, width=512, elem_id="video", visible=True, label="Media") + image = gr.Image(height=512, width=512, elem_id="image", visible=False, label="Media") with gr.Column(scale=7): chatbot = gr.Chatbot(elem_id="chatbot", label="MultimodalQnA Chatbot", height=390) with gr.Row(): @@ -293,7 +448,8 @@ def clear_uploaded_video(request: gr.Request): # show_label=False, # container=False, label="Query", - info="Enter your query here!", + info="Enter a text query below", + # submit_btn=False, ) with gr.Column(scale=1, min_width=100): with gr.Row(): @@ -306,7 +462,7 @@ def clear_uploaded_video(request: gr.Request): [ state, ], - [state, chatbot, textbox, video, clear_btn], + [state, chatbot, textbox, video, image, clear_btn], ) submit_btn.click( @@ -318,17 +474,19 @@ def clear_uploaded_video(request: gr.Request): [ state, ], - [state, chatbot, video, clear_btn], + [state, chatbot, video, image, clear_btn], ) with gr.Blocks(css=css) as demo: gr.Markdown("# MultimodalQnA") with gr.Tabs(): - with gr.TabItem("MultimodalQnA With Your Videos"): + with gr.TabItem("MultimodalQnA"): qna.render() - with gr.TabItem("Upload Your Own Videos"): - upload_gen_trans.render() - with gr.TabItem("Upload Your Own Videos"): - upload_gen_captions.render() + with gr.TabItem("Upload Video"): + upload_video.render() + with gr.TabItem("Upload Image"): + upload_image.render() + with gr.TabItem("Upload Audio"): + upload_audio.render() demo.queue() app = gr.mount_gradio_app(app, demo, path="/") @@ -343,6 +501,9 @@ def clear_uploaded_video(request: gr.Request): parser.add_argument("--share", action="store_true") backend_service_endpoint = os.getenv("BACKEND_SERVICE_ENDPOINT", "http://localhost:8888/v1/multimodalqna") + dataprep_ingest_endpoint = os.getenv( + "DATAPREP_INGEST_SERVICE_ENDPOINT", "http://localhost:6007/v1/ingest_with_text" + ) dataprep_gen_transcript_endpoint = os.getenv( "DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT", "http://localhost:6007/v1/generate_transcripts" ) @@ -353,9 +514,11 @@ def clear_uploaded_video(request: gr.Request): logger.info(f"args: {args}") global gateway_addr gateway_addr = backend_service_endpoint + global dataprep_ingest_addr + dataprep_ingest_addr = dataprep_ingest_endpoint global dataprep_gen_transcript_addr dataprep_gen_transcript_addr = dataprep_gen_transcript_endpoint - global dataprep_gen_captiono_addr - dataprep_gen_captiono_addr = dataprep_gen_caption_endpoint + global dataprep_gen_caption_addr + dataprep_gen_caption_addr = dataprep_gen_caption_endpoint uvicorn.run(app, host=args.host, port=args.port) diff --git a/MultimodalQnA/ui/gradio/utils.py b/MultimodalQnA/ui/gradio/utils.py index f6e1027eb..7a730a7ed 100644 --- a/MultimodalQnA/ui/gradio/utils.py +++ b/MultimodalQnA/ui/gradio/utils.py @@ -5,6 +5,7 @@ import logging import logging.handlers import os +import shutil import sys from pathlib import Path @@ -118,6 +119,18 @@ def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER return cv2.resize(image, dim, interpolation=inter) +def make_temp_image( + image_name, + file_ext, + output_image_path: str = "./public/images", + output_image_name: str = "image_tmp", +): + Path(output_image_path).mkdir(parents=True, exist_ok=True) + output_image = os.path.join(output_image_path, "{}.{}".format(output_image_name, file_ext)) + shutil.copy(image_name, output_image) + return output_image + + # function to split video at a timestamp def split_video( video_path, diff --git a/README.md b/README.md index 87581d3dd..a34166387 100644 --- a/README.md +++ b/README.md @@ -37,18 +37,19 @@ Deployment are based on released docker images by default, check [docker image l #### Deploy Examples -| Use Case | Docker Compose
Deployment on Xeon | Docker Compose
Deployment on Gaudi | Kubernetes with Manifests | Kubernetes with Helm Charts | Kubernetes with GMC | -| ----------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | -| ChatQnA | [Xeon Instructions](ChatQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](ChatQnA/docker_compose/intel/hpu/gaudi/README.md) | [ChatQnA with Manifests](ChatQnA/kubernetes/intel/README.md) | [ChatQnA with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) | [ChatQnA with GMC](ChatQnA/kubernetes/intel/README_gmc.md) | -| CodeGen | [Xeon Instructions](CodeGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeGen/docker_compose/intel/hpu/gaudi/README.md) | [CodeGen with Manifests](CodeGen/kubernetes/intel/README.md) | [CodeGen with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codegen/README.md) | [CodeGen with GMC](CodeGen/kubernetes/intel/README_gmc.md) | -| CodeTrans | [Xeon Instructions](CodeTrans/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeTrans/docker_compose/intel/hpu/gaudi/README.md) | [CodeTrans with Manifests](CodeTrans/kubernetes/intel/README.md) | [CodeTrans with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codetrans/README.md) | [CodeTrans with GMC](CodeTrans/kubernetes/intel/README_gmc.md) | -| DocSum | [Xeon Instructions](DocSum/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](DocSum/docker_compose/intel/hpu/gaudi/README.md) | [DocSum with Manifests](DocSum/kubernetes/intel/README.md) | [DocSum with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/docsum/README.md) | [DocSum with GMC](DocSum/kubernetes/intel/README_gmc.md) | -| SearchQnA | [Xeon Instructions](SearchQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](SearchQnA/docker_compose/intel/hpu/gaudi/README.md) | Not Supported | Not Supported | [SearchQnA with GMC](SearchQnA/kubernetes/intel/README_gmc.md) | -| FaqGen | [Xeon Instructions](FaqGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](FaqGen/docker_compose/intel/hpu/gaudi/README.md) | [FaqGen with Manifests](FaqGen/kubernetes/intel/README.md) | Not Supported | [FaqGen with GMC](FaqGen/kubernetes/intel/README_gmc.md) | -| Translation | [Xeon Instructions](Translation/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](Translation/docker_compose/intel/hpu/gaudi/README.md) | [Translation with Manifests](Translation/kubernetes/intel/README.md) | Not Supported | [Translation with GMC](Translation/kubernetes/intel/README_gmc.md) | -| AudioQnA | [Xeon Instructions](AudioQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](AudioQnA/docker_compose/intel/hpu/gaudi/README.md) | [AudioQnA with Manifests](AudioQnA/kubernetes/intel/README.md) | Not Supported | [AudioQnA with GMC](AudioQnA/kubernetes/intel/README_gmc.md) | -| VisualQnA | [Xeon Instructions](VisualQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](VisualQnA/docker_compose/intel/hpu/gaudi/README.md) | [VisualQnA with Manifests](VisualQnA/kubernetes/intel/README.md) | Not Supported | [VisualQnA with GMC](VisualQnA/kubernetes/intel/README_gmc.md) | -| ProductivitySuite | [Xeon Instructions](ProductivitySuite/docker_compose/intel/cpu/xeon/README.md) | Not Supported | [ProductivitySuite with Manifests](ProductivitySuite/kubernetes/intel/README.md) | Not Supported | Not Supported | +| Use Case | Docker Compose
Deployment on Xeon | Docker Compose
Deployment on Gaudi | Kubernetes with Manifests | Kubernetes with Helm Charts | Kubernetes with GMC | +| ----------------- | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | +| ChatQnA | [Xeon Instructions](ChatQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](ChatQnA/docker_compose/intel/hpu/gaudi/README.md) | [ChatQnA with Manifests](ChatQnA/kubernetes/intel/README.md) | [ChatQnA with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) | [ChatQnA with GMC](ChatQnA/kubernetes/intel/README_gmc.md) | +| CodeGen | [Xeon Instructions](CodeGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeGen/docker_compose/intel/hpu/gaudi/README.md) | [CodeGen with Manifests](CodeGen/kubernetes/intel/README.md) | [CodeGen with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codegen/README.md) | [CodeGen with GMC](CodeGen/kubernetes/intel/README_gmc.md) | +| CodeTrans | [Xeon Instructions](CodeTrans/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](CodeTrans/docker_compose/intel/hpu/gaudi/README.md) | [CodeTrans with Manifests](CodeTrans/kubernetes/intel/README.md) | [CodeTrans with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/codetrans/README.md) | [CodeTrans with GMC](CodeTrans/kubernetes/intel/README_gmc.md) | +| DocSum | [Xeon Instructions](DocSum/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](DocSum/docker_compose/intel/hpu/gaudi/README.md) | [DocSum with Manifests](DocSum/kubernetes/intel/README.md) | [DocSum with Helm Charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/docsum/README.md) | [DocSum with GMC](DocSum/kubernetes/intel/README_gmc.md) | +| SearchQnA | [Xeon Instructions](SearchQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](SearchQnA/docker_compose/intel/hpu/gaudi/README.md) | Not Supported | Not Supported | [SearchQnA with GMC](SearchQnA/kubernetes/intel/README_gmc.md) | +| FaqGen | [Xeon Instructions](FaqGen/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](FaqGen/docker_compose/intel/hpu/gaudi/README.md) | [FaqGen with Manifests](FaqGen/kubernetes/intel/README.md) | Not Supported | [FaqGen with GMC](FaqGen/kubernetes/intel/README_gmc.md) | +| Translation | [Xeon Instructions](Translation/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](Translation/docker_compose/intel/hpu/gaudi/README.md) | [Translation with Manifests](Translation/kubernetes/intel/README.md) | Not Supported | [Translation with GMC](Translation/kubernetes/intel/README_gmc.md) | +| AudioQnA | [Xeon Instructions](AudioQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](AudioQnA/docker_compose/intel/hpu/gaudi/README.md) | [AudioQnA with Manifests](AudioQnA/kubernetes/intel/README.md) | Not Supported | [AudioQnA with GMC](AudioQnA/kubernetes/intel/README_gmc.md) | +| VisualQnA | [Xeon Instructions](VisualQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](VisualQnA/docker_compose/intel/hpu/gaudi/README.md) | [VisualQnA with Manifests](VisualQnA/kubernetes/intel/README.md) | Not Supported | [VisualQnA with GMC](VisualQnA/kubernetes/intel/README_gmc.md) | +| MultimodalQnA | [Xeon Instructions](MultimodalQnA/docker_compose/intel/cpu/xeon/README.md) | [Gaudi Instructions](MultimodalQnA/docker_compose/intel/hpu/gaudi/README.md) | Not supported | Not supported | Not supported | +| ProductivitySuite | [Xeon Instructions](ProductivitySuite/docker_compose/intel/cpu/xeon/README.md) | Not Supported | [ProductivitySuite with Manifests](ProductivitySuite/kubernetes/intel/README.md) | Not Supported | Not Supported | ## Supported Examples diff --git a/docker_images_list.md b/docker_images_list.md index 8380efde7..ea25a906e 100644 --- a/docker_images_list.md +++ b/docker_images_list.md @@ -26,8 +26,8 @@ Take ChatQnA for example. ChatQnA is a chatbot application service based on the | [opea/faqgen](https://hub.docker.com/r/opea/faqgen) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/Dockerfile) | The docker image served as a faqgen gateway and automatically generating comprehensive, natural sounding Frequently Asked Questions (FAQs) from documents, legal texts, customer inquiries and other sources. | | [opea/faqgen-ui](https://hub.docker.com/r/opea/faqgen-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/ui/docker/Dockerfile) | The docker image serves as the docsum UI entry point for easy interaction with users, generating FAQs by pasting in question text. | | [opea/faqgen-react-ui](https://hub.docker.com/r/opea/faqgen-react-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/FaqGen/ui/docker/Dockerfile.react) | The purpose of the docker image is to provide a user interface for Generate FAQs using React. It allows generating FAQs by uploading files or pasting text. | -| [opea/multimodalqna](https://hub.docker.com/r/opea/multimodalqna) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/Dockerfile) | The docker image served as a multimodalqna gateway and dynamically fetches the most relevant multimodal information (frames, transcripts, and/or subtitles) from the user's video collection to solve the problem. | -| [opea/multimodalqna-ui](https://hub.docker.com/r/opea/multimodalqna-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/ui/docker/Dockerfile) | The docker image serves as the docsum UI entry point for easy interaction with users. Answers to questions are generated from videos uploaded by users.. | +| [opea/multimodalqna](https://hub.docker.com/r/opea/multimodalqna) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/Dockerfile) | The docker image served as a multimodalqna gateway and dynamically fetches the most relevant multimodal information (frames, transcripts, and/or subtitles) from the user's video, image, or audio collection to solve the problem. | +| [opea/multimodalqna-ui](https://hub.docker.com/r/opea/multimodalqna-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/ui/docker/Dockerfile) | The docker image serves as the multimodalqna UI entry point for easy interaction with users. Answers to questions are generated from uploaded by users. | | [opea/productivity-suite-react-ui-server](https://hub.docker.com/r/opea/productivity-suite-react-ui-server) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/ProductivitySuite/ui/docker/Dockerfile.react) | The purpose of the docker image is to provide a user interface for Productivity Suite Application using React. It allows interaction by uploading documents and inputs. | | [opea/searchqna](https://hub.docker.com/r/opea/searchqna/tags) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/SearchQnA/Dockerfile) | The docker image served as the searchqna gateway to provide service of retrieving accurate and relevant answers to user queries from a knowledge base or dataset | | [opea/searchqna-ui](https://hub.docker.com/r/opea/searchqna-ui) | [Link](https://github.com/opea-project/GenAIExamples/blob/main/SearchQnA/ui/docker/Dockerfile) | The docker image acted as the searchqna UI entry for facilitating interaction with users for question answering | diff --git a/supported_examples.md b/supported_examples.md index 33b02f71d..0754be3ee 100644 --- a/supported_examples.md +++ b/supported_examples.md @@ -186,7 +186,15 @@ FAQ Generation Application leverages the power of large language models (LLMs) t ### MultimodalQnA -[MultimodalQnA](./MultimodalQnA/README.md) addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. +[MultimodalQnA](./MultimodalQnA/README.md) addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, or audio files. MultimodalQnA utilizes BridgeTower model, a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user. + +| Service | Model | HW | Description | +| --------- | ----------------------------------------------------------------------------------------------------------------- | ---------- | ----------------------------- | +| Embedding | [BridgeTower/bridgetower-large-itm-mlm-itc](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-itc) | Xeon/Gaudi | Multimodal embeddings service | +| Embedding | [BridgeTower/bridgetower-large-itm-mlm-gaudi](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi) | Gaudi | Multimodal embeddings service | +| LVM | [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | Xeon | LVM service | +| LVM | [llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf) | Xeon | LVM service | +| LVM | [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) | Gaudi | LVM service | ### ProductivitySuite