Added Gemini as an alternative to Whisper for transcription

Change-Id: I0d03569bab23f9d31480284962d72e83edb19a7f
google-marketing-solutions · Nov 4, 2024 · 43175f1 · 43175f1
1 parent 88228f9
commit 43175f1
Show file tree

Hide file tree

Showing 10 changed files with 225 additions and 28 deletions.
diff --git a/README.md b/README.md
@@ -33,10 +33,11 @@ limitations under the License.
 Update to the latest version by running `npm run update-app` after pulling the latest changes from the repository via `git pull --rebase --autostash`; you would need to redploy the *UI* for features marked as `frontend`, and *GCP components* for features marked as `backend`.
 
 * [October 2024]
+  * `frontend` + `backend`: Added Gemini as an alternative to Whisper for transcription. Read more [here](#2-video-processing-and-extraction).
   * `frontend`: Added functionality to regenerate Demand Gen text assets, Read more [here](#6-output-videos).
   * `frontend` + `backend`: Added functionality to "fade out" audio at the end of generated videos. Read more [here](#42-user-controls-for-video-rendering).
 * [September 2024]
-  * `backend`: You can now process any video of any length or size - even beyond the Google Cloud Video AI API [limits](https://cloud.google.com/video-intelligence/quotas) of 50 GB size and up to 3h video length.
+  * `backend`: You can now process any video of any length or size - even beyond the Google Cloud Video Intelligence API [limits](https://cloud.google.com/video-intelligence/quotas) of 50 GB size and up to 3h video length.
 * [August 2024]
   * Updated the [pricing](#pricing-and-quotas) section and Cloud calculator example to use the new (cheaper) pricing for `Gemini 1.5 Flash`.
   * `frontend`: You can now manually move the Smart Framing crop area to better capture the point of interest. Read more [here](#3-object-tracking-and-smart-framing).
@@ -172,9 +173,17 @@ Users upload or select videos they have previously analysed via the UI's `Video
 New uploads into GCS trigger the Extractor service Cloud Function, which extracts all video information and stores the results on GCS (`input.vtt`, `analysis.json` and `data.json`).
 
 * First, background music and voice-over (if available) are separated via the [spleeter](https://github.com/deezer/spleeter) library, and the voice-over is transcribed.
-* Transcription is done via the [faster-whisper](https://github.com/SYSTRAN/faster-whisper) library, which uses OpenAI's Whisper model under the hood. By default, Vigenair uses the [small](https://github.com/openai/whisper#available-models-and-languages) multilingual model which provides the optimal quality-performance balance. If you find that it is not working well for your target language you may change the model used by the Cloud Function by setting the `CONFIG_WHISPER_MODEL` variable in the [update_config.sh](service/update_config.sh) script, which can be used to update the function's runtime variables. The transcription output is stored in an `input.vtt` file, along with a `language.txt` file containing the video's primary language, in the same folder as the input video.
-* Video analysis is done via the Cloud [Video AI API](https://cloud.google.com/video-intelligence), where visual shots, detected objects - with tracking, labels, people and faces, and recognised logos and any on-screen text within the input video are extracted. The output is stored in an `analysis.json` file in the same folder as the input video.
-* Finally, *coherent* audio/video segments are created using the transcription and video intelligence outputs and then cut into individual video files and stored on GCS in an `av_segments_cuts` subfolder under the root video folder. These cuts are then annotated via multimodal models on Vertex AI, which provide a description and a set of associated keywords / topics per segment. The fully annotated segments (including all information from the Video AI API) are then compiled into a `data.json` file that is stored in the same folder as the input video.
+* Transcription can either be done via Gemini or the [faster-whisper](https://github.com/SYSTRAN/faster-whisper) library, which uses OpenAI's Whisper model under the hood. This is controlled by the following configuration properties:
+
+| Component | Configuration Property | Supported Values | Default Value |
+| --- | --- | --- | --- |
+| `frontend` | `CONFIG.defaultTranscriptionService` property in [config.ts](ui/src/config.ts) | `whisper` or `gemini` | `whisper` |
+| `backend` | `CONFIG_TRANSCRIPTION_SERVICE` environment variable | `whisper` or `gemini` | `whisper` |
+| `backend` | `CONFIG_TRANSCRIPTION_MODEL` environment variable | Supported models for [whisper](https://github.com/openai/whisper#available-models-and-languages) and [gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) | `small` (whisper) |
+
+* Vigenair defaults to using Whisper for its optimal transcription quality. The [small](https://github.com/openai/whisper#available-models-and-languages) multilingual model is used by default, providing the best quality-performance balance. However, if it doesn't work well for your target language, you can change the model used by the Cloud Function. This is done by setting the `CONFIG_TRANSCRIPTION_MODEL` variable in the [update_config.sh](service/update_config.sh) script, which can be used to update the function's runtime variables. You can also switch the transcription service to Gemini as shown in the table above. When switching to Gemini, ensure the `CONFIG_TRANSCRIPTION_MODEL` variable is set to `gemini-1.5-flash` or any supported [Gemini model](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) on Vertex AI. The transcription output is then stored in an `input.vtt` file, along with a `language.txt` file containing the video's primary language, in the same folder as the input video.
+* Video analysis is done via the Cloud [Video Intelligence API](https://cloud.google.com/video-intelligence), where visual shots, detected objects - with tracking, labels, people and faces, and recognised logos and any on-screen text within the input video are extracted. The output is stored in an `analysis.json` file in the same folder as the input video.
+* Finally, *coherent* audio/video segments are created using the transcription and video intelligence outputs and then cut into individual video files and stored on GCS in an `av_segments_cuts` subfolder under the root video folder. These cuts are then annotated via multimodal models on Vertex AI, which provide a description and a set of associated keywords / topics per segment. The fully annotated segments (including all information from the Video Intelligence API) are then compiled into a `data.json` file that is stored in the same folder as the input video.
 
 #### 3. Object Tracking and Smart Framing
 

diff --git a/service/.env.yaml b/service/.env.yaml
@@ -16,7 +16,8 @@ GCP_PROJECT_ID: '<gcp-project-id>'
 GCP_LOCATION: '<gcp-region>'
 CONFIG_TEXT_MODEL: gemini-1.5-flash
 CONFIG_VISION_MODEL: gemini-1.5-flash
-CONFIG_WHISPER_MODEL: small
+CONFIG_TRANSCRIPTION_SERVICE: whisper
+CONFIG_TRANSCRIPTION_MODEL: small
 CONFIG_ANNOTATIONS_CONFIDENCE_THRESHOLD: '0.7'
 CONFIG_MULTIMODAL_ASSET_GENERATION: 'true'
 CONFIG_MAX_VIDEO_CHUNK_SIZE: '1000000000' # 1 GB

diff --git a/service/audio/audio.py b/service/audio/audio.py
@@ -18,18 +18,22 @@
 """
 
 import datetime
+import io
 import logging
 import os
 import pathlib
+import re
 import shutil
 from typing import Optional, Sequence, Tuple
 
 import config as ConfigService
+from faster_whisper import WhisperModel
+from iso639 import languages
 import pandas as pd
 import utils as Utils
+import vertexai
+from vertexai.generative_models import GenerativeModel, Part
 import whisper
-from faster_whisper import WhisperModel
-from iso639 import languages
 
 
 def combine_audio_files(output_path: str, audio_files: Sequence[str]):
@@ -229,18 +233,122 @@ def split_audio(
 def transcribe_audio(
     output_dir: str,
     audio_file_path: str,
+    transcription_service: Utils.TranscriptionService,
+    gcs_folder: str,
+    gcs_bucket_name: str,
 ) -> Tuple[pd.DataFrame, str, float]:
   """Transcribes an audio file and returns the transcription.
 
   Args:
     output_dir: Directory where the transcription will be saved.
     audio_file_path: Path to the audio file that will be transcribed.
+    transcription_service: The service to use for transcription.
+    gcs_folder: The GCS folder to use.
+    gcs_bucket_name: The GCS bucket to use.
 
   Returns:
     A pandas dataframe with the transcription data.
   """
+  match transcription_service:
+    case Utils.TranscriptionService.GEMINI:
+      return _transcribe_gemini(
+          output_dir, audio_file_path, gcs_folder, gcs_bucket_name
+      )
+    case Utils.TranscriptionService.WHISPER | _:
+      return _transcribe_whisper(output_dir, audio_file_path)
+
+
+def _transcribe_gemini(
+    output_dir: str,
+    audio_file_path: str,
+    gcs_folder: str,
+    gcs_bucket_name: str,
+) -> Tuple[pd.DataFrame, str, float]:
+  """Transcribes audio using Gemini."""
+  transcription_dataframe = pd.DataFrame()
+  video_language = ConfigService.DEFAULT_VIDEO_LANGUAGE
+  language_probability = 0.0
+
+  vertexai.init(
+      project=ConfigService.GCP_PROJECT_ID,
+      location=ConfigService.GCP_LOCATION,
+  )
+  transcription_model = (
+      GenerativeModel(ConfigService.CONFIG_TRANSCRIPTION_MODEL)
+  )
+  audio_file_gcs_uri = f'gs://{gcs_bucket_name}/{gcs_folder}' + (
+      f'/{ConfigService.OUTPUT_ANALYSIS_CHUNKS_DIR}'
+      if ConfigService.OUTPUT_ANALYSIS_CHUNKS_DIR in audio_file_path else ''
+  ) + audio_file_path.replace(output_dir, '')
+  try:
+    response = transcription_model.generate_content(
+        [
+            Part.from_uri(audio_file_gcs_uri, mime_type='audio/wav'),
+            ConfigService.TRANSCRIBE_AUDIO_PROMPT,
+        ],
+        generation_config=ConfigService.TRANSCRIBE_AUDIO_CONFIG,
+        safety_settings=ConfigService.CONFIG_DEFAULT_SAFETY_CONFIG,
+    )
+    if (
+        response.candidates and response.candidates[0].content.parts
+        and response.candidates[0].content.parts[0].text
+    ):
+      text = response.candidates[0].content.parts[0].text
+      result = (
+          re.search(ConfigService.TRANSCRIBE_AUDIO_PATTERN, text, re.DOTALL)
+      )
+      logging.info('TRANSCRIPTION - %s', text)
+      video_language = result.group(1)
+      language_probability = result.group(2)
+      transcription_dataframe = (
+          pd.read_csv(io.StringIO(result.group(3)), usecols=[
+              0, 1, 2
+          ]).dropna(axis=1, how='all').rename(
+              columns={
+                  'Start': 'start_s',
+                  'End': 'end_s',
+                  'Transcription': 'transcript',
+              }
+          ).assign(
+              audio_segment_id=lambda df: range(1,
+                                                len(df) + 1),
+              start_s=lambda df: df['start_s'].
+              apply(Utils.timestring_to_seconds),
+              end_s=lambda df: df['end_s'].apply(Utils.timestring_to_seconds),
+              duration_s=lambda df: df['end_s'] - df['start_s'],
+          )
+      )
+      subtitles_output_path = audio_file_path.replace(
+          'wav', ConfigService.OUTPUT_SUBTITLES_TYPE
+      )
+      with open(subtitles_output_path, 'w', encoding='utf8') as f:
+        f.write(result.group(4))
+      logging.info(
+          'TRANSCRIPTION - transcript for %s written successfully!',
+          audio_file_path,
+      )
+    else:
+      logging.warning(
+          'Could not transcribe audio! Returning empty transcription...'
+      )
+  # Execution should continue regardless of the underlying exception
+  # pylint: disable=broad-exception-caught
+  except Exception:
+    logging.exception(
+        'Encountered error during transcription! '
+        'Returning empty transcription...'
+    )
+
+  return transcription_dataframe, video_language, float(language_probability)
+
+
+def _transcribe_whisper(
+    output_dir: str,
+    audio_file_path: str,
+) -> Tuple[pd.DataFrame, str, float]:
+  """Transcribes audio using Whisper."""
   model = WhisperModel(
-      ConfigService.CONFIG_WHISPER_MODEL,
+      ConfigService.CONFIG_TRANSCRIPTION_MODEL,
       device=ConfigService.DEVICE,
       compute_type='int8',
   )

diff --git a/service/config/config.py b/service/config/config.py
@@ -27,12 +27,17 @@
 GCP_LOCATION = os.environ.get('GCP_LOCATION', 'us-central1')
 CONFIG_TEXT_MODEL = os.environ.get('CONFIG_TEXT_MODEL', 'gemini-1.5-flash')
 CONFIG_VISION_MODEL = os.environ.get('CONFIG_VISION_MODEL', 'gemini-1.5-flash')
-CONFIG_WHISPER_MODEL = os.environ.get('CONFIG_WHISPER_MODEL', 'small')
+CONFIG_TRANSCRIPTION_SERVICE = os.environ.get(
+    'CONFIG_TRANSCRIPTION_SERVICE', 'whisper'
+)
+CONFIG_TRANSCRIPTION_MODEL = os.environ.get(
+    'CONFIG_TRANSCRIPTION_MODEL', 'small'
+)
 CONFIG_ANNOTATIONS_CONFIDENCE_THRESHOLD = float(
     os.environ.get('CONFIG_ANNOTATIONS_CONFIDENCE_THRESHOLD', '0.7')
 )
 CONFIG_MULTIMODAL_ASSET_GENERATION = os.environ.get(
-    'CONFIG_MULTIMODAL_ASSET_GENERATION', 'false'
+    'CONFIG_MULTIMODAL_ASSET_GENERATION', 'true'
 ) == 'true'
 CONFIG_MAX_VIDEO_CHUNK_SIZE = float(
     os.environ.get(
@@ -105,7 +110,6 @@
     'max_output_tokens': 2048,
     'temperature': 0.2,
     'top_p': 1,
-    'top_k': 16,
 }
 
 # pylint: disable=line-too-long
@@ -146,7 +150,33 @@
     'max_output_tokens': 2048,
     'temperature': 0.2,
     'top_p': 1,
-    'top_k': 32,
 }
 
 DEFAULT_VIDEO_LANGUAGE = 'English'
+
+TRANSCRIBE_AUDIO_PROMPT = """Transcribe the provided audio file, paying close attention to speaker changes and pauses in speech.
+Output the following, in this order:
+1. **Language:** Specify the language of the audio (e.g., "Language: English")
+2. **Confidence:**  Specify the confidence score of the transcription (e.g., "Confidence: 0.95")
+3. **Transcription CSV:** Output the transcription in CSV (Comma-Separated Values) format (e.g. ```csv<output>```) with these columns:
+    * **Start:** (Start timestamp for each utterance in the format "mm:ss.SSS")
+    * **End:** (End timestamp for each utterance in the format "mm:ss.SSS")
+    * **Transcription:** (The transcribed text of the utterance)
+    Ensure each row in the CSV corresponds to a complete sentence or a meaningful phrase. Sentences by different speakers, even if related, should not be grouped together.
+    **Critical Timestamping Requirements:**
+        * **Pause Detection:** It is absolutely essential to accurately identify and incorporate pauses in speech. If there is a period of silence between utterances, even a brief one, this MUST be reflected in the timestamps. Do not assume continuous speech.
+        * **No Overlapping:** Timestamps for consecutive sentences should NOT overlap. The end timestamp of one sentence should be the start timestamp of the next sentence ONLY if there is no pause between them.
+4. **WebVTT Format:** Output the transcription information in WebVTT format, surrounded by backticks (e.g. ```vtt<output>```)
+
+**Constraints:**
+    * **No Extra Text:** Only output the language, confidence, table, and WebVTT data, without any additional text or explanations. This includes avoiding any labels or headings before or after the transcription table and WebVTT data.
+    * **Valid Timestamps:** All timestamps MUST be within the actual duration of the audio. No timestamps should exceed the total length of the audio. This is absolutely critical.
+    * **Sequential Timestamps:** Timestamps should progress sequentially and logically from the beginning to the end of the audio.
+
+"""
+TRANSCRIBE_AUDIO_CONFIG = {
+    'max_output_tokens': 8192,
+    'temperature': 0.2,
+    'top_p': 1,
+}
+TRANSCRIBE_AUDIO_PATTERN = '.*Language: ?(.*)\n*.*Confidence: ?(.*)\n*```csv\n(.*)```\n*```vtt\n(.*)```'
diff --git a/service/extractor/extractor.py b/service/extractor/extractor.py
@@ -168,16 +168,20 @@ def _process_audio(
     input_audio_file_path: str,
     gcs_folder: str,
     gcs_bucket_name: str,
-    analyse_audio: bool,
+    transcription_service: Utils.TranscriptionService,
 ) -> pd.DataFrame:
   transcription_dataframe = pd.DataFrame()
 
-  if input_audio_file_path and analyse_audio:
+  if (
+      input_audio_file_path
+      and transcription_service != Utils.TranscriptionService.NONE
+  ):
     transcription_dataframe = _process_video_with_audio(
         output_dir,
         input_audio_file_path,
         gcs_folder,
         gcs_bucket_name,
+        transcription_service,
     )
   else:
     _process_video_without_audio(output_dir, gcs_folder, gcs_bucket_name)
@@ -190,6 +194,7 @@ def _process_video_with_audio(
     input_audio_file_path: str,
     gcs_folder: str,
     gcs_bucket_name: str,
+    transcription_service: Utils.TranscriptionService,
 ) -> pd.DataFrame:
   audio_chunks = _get_audio_chunks(
       output_dir=output_dir,
@@ -215,6 +220,9 @@ def _process_video_with_audio(
             output_dir=(audio_output_dir if size > 1 else output_dir),
             index=index + 1,
             audio_file_path=audio_file_path,
+            transcription_service=transcription_service,
+            gcs_folder=gcs_folder,
+            gcs_bucket_name=gcs_bucket_name,
         ): index
         for index, audio_file_path in enumerate(audio_chunks)
     }
@@ -355,18 +363,26 @@ def _analyse_audio(
     output_dir: str,
     index: int,
     audio_file_path: str,
+    transcription_service: Utils.TranscriptionService,
+    gcs_folder: str,
+    gcs_bucket_name=str,
 ) -> Tuple[str, str, str, str, float]:
   """Runs audio analysis in parallel."""
   vocals_file_path = None
   music_file_path = None
   transcription_dataframe = None
 
-  with concurrent.futures.ProcessPoolExecutor(max_workers=2) as process_executor:
+  with (
+      concurrent.futures.ProcessPoolExecutor(max_workers=2) as process_executor
+  ):
     futures_dict = {
         process_executor.submit(
             AudioService.transcribe_audio,
             output_dir=output_dir,
             audio_file_path=audio_file_path,
+            transcription_service=transcription_service,
+            gcs_folder=gcs_folder,
+            gcs_bucket_name=gcs_bucket_name,
         ): 'transcribe_audio',
         process_executor.submit(
             AudioService.split_audio,
@@ -513,6 +529,12 @@ def extract(self):
         bucket_name=self.gcs_bucket_name,
     )
     input_audio_file_path = AudioService.extract_audio(input_video_file_path)
+    if input_audio_file_path:
+      StorageService.upload_gcs_dir(
+          source_directory=tmp_dir,
+          bucket_name=self.gcs_bucket_name,
+          target_dir=self.video_file.gcs_folder,
+      )
     annotation_results = None
     transcription_dataframe = pd.DataFrame()
 
@@ -524,7 +546,8 @@ def extract(self):
               input_audio_file_path=input_audio_file_path,
               gcs_folder=self.video_file.gcs_folder,
               gcs_bucket_name=self.gcs_bucket_name,
-              analyse_audio=self.video_file.video_metadata.analyse_audio,
+              transcription_service=self.video_file.video_metadata.
+              transcription_service,
           ): 'process_audio',
           process_executor.submit(
               _process_video,

diff --git a/service/requirements.txt b/service/requirements.txt
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-faster-whisper==1.0.2
+faster-whisper==1.0.3
 ffmpeg==1.4
 ffprobe==0.5
 functions-framework==3.7.0
@@ -24,7 +24,7 @@ google-cloud-storage==2.16.0
 google-cloud-videointelligence==2.13.3
 iso-639==0.4.5
 numpy==1.26.4
-openai-whisper==20231117
+openai-whisper==20240930
 pandas==1.5.3
 protobuf==3.19.6
 spleeter==2.4.0

diff --git a/service/update_config.sh b/service/update_config.sh
@@ -15,4 +15,4 @@
 
 # Update runtime environment variables without having to fully redeploy the cloud function.
 gcloud functions deploy vigenair \
---update-env-vars CONFIG_WHISPER_MODEL=large,CONFIG_TEXT_MODEL=gemini-1.5-pro,CONFIG_MULTIMODAL_ASSET_GENERATION='true'
+--update-env-vars CONFIG_TRANSCRIPTION_SERVICE=gemini,CONFIG_TRANSCRIPTION_MODEL=gemini-1.5-flash