MultimodalQnA Image and Audio Support Phase 1 (#852)

* Adds an endpoint for image ingestion Signed-off-by: Melanie Buehler <[email protected]> * Combined image and video endpoint Signed-off-by: Melanie Buehler <[email protected]> * Add test and update README Signed-off-by: Melanie Buehler <[email protected]> * fixed variable name for embedding model (#1) Signed-off-by: okhleif-IL <[email protected]> * Fixed test script Signed-off-by: Melanie Buehler <[email protected]> * Remove redundant function Signed-off-by: Melanie Buehler <[email protected]> * get_videos, delete_videos --> get_files, delete_files (#3) Signed-off-by: okhleif-IL <[email protected]> * Updates test per review feedback Signed-off-by: Melanie Buehler <[email protected]> * Fixed test Signed-off-by: Melanie Buehler <[email protected]> * Add support for audio files multimodal data ingestion (#4) * Add support for audio files multimodal data ingestion Signed-off-by: dmsuehir <[email protected]> * Update function name Signed-off-by: dmsuehir <[email protected]> --------- Signed-off-by: dmsuehir <[email protected]> * Change videos_with_transcripts to ingest_with_text Signed-off-by: Melanie Buehler <[email protected]> * Add image support to video ingestion with transcript functionality Signed-off-by: Melanie Buehler <[email protected]> * Update test and README Signed-off-by: Melanie Buehler <[email protected]> * Updated for review suggestions Signed-off-by: Melanie Buehler <[email protected]> * Add two tests for ingest_with_text Signed-off-by: Melanie Buehler <[email protected]> * LVM TGI Gaudi update for prompts without images (#7) * LVM Gaudi TGI update for prompts without images Signed-off-by: dmsuehir <[email protected]> * Wording Signed-off-by: dmsuehir <[email protected]> * Add a test Signed-off-by: dmsuehir <[email protected]> --------- Signed-off-by: dmsuehir <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change dummy image to be b64 encoded instead of the url (#9) Signed-off-by: dmsuehir <[email protected]> * Updates based on review feedback (#10) Signed-off-by: dmsuehir <[email protected]> * Test fix (#11) Signed-off-by: dmsuehir <[email protected]> --------- Signed-off-by: Melanie Buehler <[email protected]> Signed-off-by: okhleif-IL <[email protected]> Signed-off-by: dmsuehir <[email protected]> Co-authored-by: dmsuehir <[email protected]> Co-authored-by: Omar Khleif <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abolfazl Shahbazi <[email protected]>
opea-project · Nov 8, 2024 · 29ef642 · 29ef642
1 parent 786cabe
commit 29ef642
Show file tree

Hide file tree

Showing 14 changed files with 618 additions and 209 deletions.
diff --git a/comps/dataprep/multimodal/redis/langchain/README.md b/comps/dataprep/multimodal/redis/langchain/README.md
@@ -1,6 +1,10 @@
 # Dataprep Microservice for Multimodal Data with Redis
 
-This `dataprep` microservice accepts videos (mp4 files) and their transcripts (optional) from the user and ingests them into Redis vectorstore.
+This `dataprep` microservice accepts the following from the user and ingests them into a Redis vector store:
+
+- Videos (mp4 files) and their transcripts (optional)
+- Images (gif, jpg, jpeg, and png files) and their captions (optional)
+- Audio (wav files)
 
 ## 🚀1. Start Microservice with Python（Option 1）
 
@@ -107,18 +111,18 @@ docker container logs -f dataprep-multimodal-redis
 
 ## 🚀4. Consume Microservice
 
-Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert videos and their transcripts (optional) to embeddings and save to the Redis vector store.
+Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images and videos and their transcripts (optional) to embeddings and save to the Redis vector store.
 
-This mircroservice has provided 3 different ways for users to ingest videos into Redis vector store corresponding to the 3 use cases.
+This microservice provides 3 different ways for users to ingest files into Redis vector store corresponding to the 3 use cases.
 
-### 4.1 Consume _videos_with_transcripts_ API
+### 4.1 Consume _ingest_with_text_ API
 
-**Use case:** This API is used when a transcript file (under `.vtt` format) is available for each video.
+**Use case:** This API is used when videos are accompanied by transcript files (`.vtt` format) or images are accompanied by text caption files (`.txt` format).
 
 **Important notes:**
 
 - Make sure the file paths after `files=@` are correct.
-- Every transcript file's name must be identical with its corresponding video file's name (except their extension .vtt and .mp4). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in this API call, this microservice will return error `No captions file video1.vtt found for video1.mp4`.
+- Every transcript or caption file's name must be identical to its corresponding video or image file's name (except their extension - .vtt goes with .mp4 and .txt goes with .jpg, .jpeg, .png, or .gif). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in the API call, the microservice will return an error `No captions file video1.vtt found for video1.mp4`.
 
 #### Single video-transcript pair upload
 
@@ -127,10 +131,20 @@ curl -X POST \
     -H "Content-Type: multipart/form-data" \
     -F "files=@./video1.mp4" \
     -F "files=@./video1.vtt" \
-    http://localhost:6007/v1/videos_with_transcripts
+    http://localhost:6007/v1/ingest_with_text
+```
+
+#### Single image-caption pair upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./image.jpg" \
+    -F "files=@./image.txt" \
+    http://localhost:6007/v1/ingest_with_text
 ```
 
-#### Multiple video-transcript pair upload
+#### Multiple file pair upload
 
 ```bash
 curl -X POST \
@@ -139,16 +153,20 @@ curl -X POST \
     -F "files=@./video1.vtt" \
     -F "files=@./video2.mp4" \
     -F "files=@./video2.vtt" \
-    http://localhost:6007/v1/videos_with_transcripts
+    -F "files=@./image1.png" \
+    -F "files=@./image1.txt" \
+    -F "files=@./image2.jpg" \
+    -F "files=@./image2.txt" \
+    http://localhost:6007/v1/ingest_with_text
 ```
 
 ### 4.2 Consume _generate_transcripts_ API
 
-**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available.
+**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available, or for audio files with speech.
 
-In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video.
+In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video or audio files.
 
-#### Single video upload
+#### Single file upload
 
 ```bash
 curl -X POST \
@@ -157,21 +175,22 @@ curl -X POST \
     http://localhost:6007/v1/generate_transcripts
 ```
 
-#### Multiple video upload
+#### Multiple file upload
 
 ```bash
 curl -X POST \
     -H "Content-Type: multipart/form-data" \
     -F "files=@./video1.mp4" \
     -F "files=@./video2.mp4" \
+    -F "files=@./audio1.wav" \
     http://localhost:6007/v1/generate_transcripts
 ```
 
 ### 4.3 Consume _generate_captions_ API
 
-**Use case:** This API should be used when a video does not have meaningful audio or does not have audio.
+**Use case:** This API should be used when uploading an image, or when uploading a video that does not have meaningful audio or does not have audio.
 
-In this use case, transcript either does not provide any meaningful information or does not exist. Thus, it is preferred to leverage a LVM microservice to summarize the video frames.
+In this use case, there is no meaningful language transcription. Thus, it is preferred to leverage a LVM microservice to summarize the frames.
 
 - Single video upload
 
@@ -192,22 +211,31 @@ curl -X POST \
     http://localhost:6007/v1/generate_captions
 ```
 
-### 4.4 Consume get_videos API
+- Single image upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./image.jpg" \
+    http://localhost:6007/v1/generate_captions
+```
+
+### 4.4 Consume get_files API
 
-To get names of uploaded videos, use the following command.
+To get names of uploaded files, use the following command.
 
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    http://localhost:6007/v1/dataprep/get_videos
+    http://localhost:6007/v1/dataprep/get_files
 ```
 
-### 4.5 Consume delete_videos API
+### 4.5 Consume delete_files API
 
-To delete uploaded videos and clear the database, use the following command.
+To delete uploaded files and clear the database, use the following command.
 
 ```bash
 curl -X POST \
     -H "Content-Type: application/json" \
-    http://localhost:6007/v1/dataprep/delete_videos
+    http://localhost:6007/v1/dataprep/delete_files
 ```
diff --git a/comps/dataprep/multimodal/redis/langchain/config.py b/comps/dataprep/multimodal/redis/langchain/config.py
@@ -4,7 +4,7 @@
 import os
 
 # Models
-EMBED_MODEL = os.getenv("EMBED_MODEL", "BridgeTower/bridgetower-large-itm-mlm-itc")
+EMBED_MODEL = os.getenv("EMBEDDING_MODEL_ID", "BridgeTower/bridgetower-large-itm-mlm-itc")
 WHISPER_MODEL = os.getenv("WHISPER_MODEL", "small")
 
 # Redis Connection Information

diff --git a/comps/dataprep/multimodal/redis/langchain/multimodal_utils.py b/comps/dataprep/multimodal/redis/langchain/multimodal_utils.py
@@ -39,8 +39,8 @@ def clear_upload_folder(upload_path):
             os.rmdir(dir_path)
 
 
-def generate_video_id():
-    """Generates a unique identifier for a video file."""
+def generate_id():
+    """Generates a unique identifier for a file."""
     return str(uuid.uuid4())
 
 
@@ -128,8 +128,49 @@ def convert_img_to_base64(image):
     return encoded_string.decode()
 
 
+def generate_annotations_from_transcript(file_id: str, file_path: str, vtt_path: str, output_dir: str):
+    """Generates an annotations.json from the transcript file."""
+
+    # Set up location to store frames and annotations
+    os.makedirs(output_dir, exist_ok=True)
+
+    # read captions file
+    captions = webvtt.read(vtt_path)
+
+    annotations = []
+    for idx, caption in enumerate(captions):
+        start_time = str2time(caption.start)
+        end_time = str2time(caption.end)
+        mid_time = (end_time + start_time) / 2
+        mid_time_ms = mid_time * 1000
+        text = caption.text.replace("\n", " ")
+
+        # Create annotations for frame from transcripts with an empty image
+        annotations.append(
+            {
+                "video_id": file_id,
+                "video_name": os.path.basename(file_path),
+                "b64_img_str": "",
+                "caption": text,
+                "time": mid_time_ms,
+                "frame_no": 0,
+                "sub_video_id": idx,
+            }
+        )
+
+    # Save transcript annotations as json file for further processing
+    with open(os.path.join(output_dir, "annotations.json"), "w") as f:
+        json.dump(annotations, f)
+
+    return annotations
+
+
 def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: str, vtt_path: str, output_dir: str):
-    """Extract frames (.png) and annotations (.json) from video file (.mp4) and captions file (.vtt)"""
+    """Extract frames (.png) and annotations (.json) from media-text file pairs.
+
+    File pairs can be a video
+    file (.mp4) and transcript file (.vtt) or an image file (.png, .jpg, .jpeg, .gif) and caption file (.txt)
+    """
     # Set up location to store frames and annotations
     os.makedirs(output_dir, exist_ok=True)
     os.makedirs(os.path.join(output_dir, "frames"), exist_ok=True)
@@ -139,18 +180,28 @@ def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: s
     fps = vidcap.get(cv2.CAP_PROP_FPS)
 
     # read captions file
-    captions = webvtt.read(vtt_path)
+    if os.path.splitext(vtt_path)[-1] == ".vtt":
+        captions = webvtt.read(vtt_path)
+    else:
+        with open(vtt_path, "r") as f:
+            captions = f.read()
 
     annotations = []
     for idx, caption in enumerate(captions):
-        start_time = str2time(caption.start)
-        end_time = str2time(caption.end)
+        if os.path.splitext(vtt_path)[-1] == ".vtt":
+            start_time = str2time(caption.start)
+            end_time = str2time(caption.end)
 
-        mid_time = (end_time + start_time) / 2
-        text = caption.text.replace("\n", " ")
+            mid_time = (end_time + start_time) / 2
+            text = caption.text.replace("\n", " ")
+
+            frame_no = time_to_frame(mid_time, fps)
+            mid_time_ms = mid_time * 1000
+        else:
+            frame_no = 0
+            mid_time_ms = 0
+            text = captions.replace("\n", " ")
 
-        frame_no = time_to_frame(mid_time, fps)
-        mid_time_ms = mid_time * 1000
         vidcap.set(cv2.CAP_PROP_POS_MSEC, mid_time_ms)
         success, frame = vidcap.read()