Skip to content

Commit

Permalink
Added Gemini as an alternative to Whisper for transcription
Browse files Browse the repository at this point in the history
Change-Id: I0d03569bab23f9d31480284962d72e83edb19a7f
  • Loading branch information
mohabfekry committed Nov 4, 2024
1 parent 88228f9 commit 43175f1
Show file tree
Hide file tree
Showing 10 changed files with 225 additions and 28 deletions.
17 changes: 13 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,11 @@ limitations under the License.
Update to the latest version by running `npm run update-app` after pulling the latest changes from the repository via `git pull --rebase --autostash`; you would need to redploy the *UI* for features marked as `frontend`, and *GCP components* for features marked as `backend`.

* [October 2024]
* `frontend` + `backend`: Added Gemini as an alternative to Whisper for transcription. Read more [here](#2-video-processing-and-extraction).
* `frontend`: Added functionality to regenerate Demand Gen text assets, Read more [here](#6-output-videos).
* `frontend` + `backend`: Added functionality to "fade out" audio at the end of generated videos. Read more [here](#42-user-controls-for-video-rendering).
* [September 2024]
* `backend`: You can now process any video of any length or size - even beyond the Google Cloud Video AI API [limits](https://cloud.google.com/video-intelligence/quotas) of 50 GB size and up to 3h video length.
* `backend`: You can now process any video of any length or size - even beyond the Google Cloud Video Intelligence API [limits](https://cloud.google.com/video-intelligence/quotas) of 50 GB size and up to 3h video length.
* [August 2024]
* Updated the [pricing](#pricing-and-quotas) section and Cloud calculator example to use the new (cheaper) pricing for `Gemini 1.5 Flash`.
* `frontend`: You can now manually move the Smart Framing crop area to better capture the point of interest. Read more [here](#3-object-tracking-and-smart-framing).
Expand Down Expand Up @@ -172,9 +173,17 @@ Users upload or select videos they have previously analysed via the UI's `Video
New uploads into GCS trigger the Extractor service Cloud Function, which extracts all video information and stores the results on GCS (`input.vtt`, `analysis.json` and `data.json`).

* First, background music and voice-over (if available) are separated via the [spleeter](https://github.com/deezer/spleeter) library, and the voice-over is transcribed.
* Transcription is done via the [faster-whisper](https://github.com/SYSTRAN/faster-whisper) library, which uses OpenAI's Whisper model under the hood. By default, Vigenair uses the [small](https://github.com/openai/whisper#available-models-and-languages) multilingual model which provides the optimal quality-performance balance. If you find that it is not working well for your target language you may change the model used by the Cloud Function by setting the `CONFIG_WHISPER_MODEL` variable in the [update_config.sh](service/update_config.sh) script, which can be used to update the function's runtime variables. The transcription output is stored in an `input.vtt` file, along with a `language.txt` file containing the video's primary language, in the same folder as the input video.
* Video analysis is done via the Cloud [Video AI API](https://cloud.google.com/video-intelligence), where visual shots, detected objects - with tracking, labels, people and faces, and recognised logos and any on-screen text within the input video are extracted. The output is stored in an `analysis.json` file in the same folder as the input video.
* Finally, *coherent* audio/video segments are created using the transcription and video intelligence outputs and then cut into individual video files and stored on GCS in an `av_segments_cuts` subfolder under the root video folder. These cuts are then annotated via multimodal models on Vertex AI, which provide a description and a set of associated keywords / topics per segment. The fully annotated segments (including all information from the Video AI API) are then compiled into a `data.json` file that is stored in the same folder as the input video.
* Transcription can either be done via Gemini or the [faster-whisper](https://github.com/SYSTRAN/faster-whisper) library, which uses OpenAI's Whisper model under the hood. This is controlled by the following configuration properties:

| Component | Configuration Property | Supported Values | Default Value |
| --- | --- | --- | --- |
| `frontend` | `CONFIG.defaultTranscriptionService` property in [config.ts](ui/src/config.ts) | `whisper` or `gemini` | `whisper` |
| `backend` | `CONFIG_TRANSCRIPTION_SERVICE` environment variable | `whisper` or `gemini` | `whisper` |
| `backend` | `CONFIG_TRANSCRIPTION_MODEL` environment variable | Supported models for [whisper](https://github.com/openai/whisper#available-models-and-languages) and [gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) | `small` (whisper) |

* Vigenair defaults to using Whisper for its optimal transcription quality. The [small](https://github.com/openai/whisper#available-models-and-languages) multilingual model is used by default, providing the best quality-performance balance. However, if it doesn't work well for your target language, you can change the model used by the Cloud Function. This is done by setting the `CONFIG_TRANSCRIPTION_MODEL` variable in the [update_config.sh](service/update_config.sh) script, which can be used to update the function's runtime variables. You can also switch the transcription service to Gemini as shown in the table above. When switching to Gemini, ensure the `CONFIG_TRANSCRIPTION_MODEL` variable is set to `gemini-1.5-flash` or any supported [Gemini model](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models) on Vertex AI. The transcription output is then stored in an `input.vtt` file, along with a `language.txt` file containing the video's primary language, in the same folder as the input video.
* Video analysis is done via the Cloud [Video Intelligence API](https://cloud.google.com/video-intelligence), where visual shots, detected objects - with tracking, labels, people and faces, and recognised logos and any on-screen text within the input video are extracted. The output is stored in an `analysis.json` file in the same folder as the input video.
* Finally, *coherent* audio/video segments are created using the transcription and video intelligence outputs and then cut into individual video files and stored on GCS in an `av_segments_cuts` subfolder under the root video folder. These cuts are then annotated via multimodal models on Vertex AI, which provide a description and a set of associated keywords / topics per segment. The fully annotated segments (including all information from the Video Intelligence API) are then compiled into a `data.json` file that is stored in the same folder as the input video.

#### 3. Object Tracking and Smart Framing

Expand Down
3 changes: 2 additions & 1 deletion service/.env.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ GCP_PROJECT_ID: '<gcp-project-id>'
GCP_LOCATION: '<gcp-region>'
CONFIG_TEXT_MODEL: gemini-1.5-flash
CONFIG_VISION_MODEL: gemini-1.5-flash
CONFIG_WHISPER_MODEL: small
CONFIG_TRANSCRIPTION_SERVICE: whisper
CONFIG_TRANSCRIPTION_MODEL: small
CONFIG_ANNOTATIONS_CONFIDENCE_THRESHOLD: '0.7'
CONFIG_MULTIMODAL_ASSET_GENERATION: 'true'
CONFIG_MAX_VIDEO_CHUNK_SIZE: '1000000000' # 1 GB
Expand Down
114 changes: 111 additions & 3 deletions service/audio/audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,22 @@
"""

import datetime
import io
import logging
import os
import pathlib
import re
import shutil
from typing import Optional, Sequence, Tuple

import config as ConfigService
from faster_whisper import WhisperModel
from iso639 import languages
import pandas as pd
import utils as Utils
import vertexai
from vertexai.generative_models import GenerativeModel, Part
import whisper
from faster_whisper import WhisperModel
from iso639 import languages


def combine_audio_files(output_path: str, audio_files: Sequence[str]):
Expand Down Expand Up @@ -229,18 +233,122 @@ def split_audio(
def transcribe_audio(
output_dir: str,
audio_file_path: str,
transcription_service: Utils.TranscriptionService,
gcs_folder: str,
gcs_bucket_name: str,
) -> Tuple[pd.DataFrame, str, float]:
"""Transcribes an audio file and returns the transcription.
Args:
output_dir: Directory where the transcription will be saved.
audio_file_path: Path to the audio file that will be transcribed.
transcription_service: The service to use for transcription.
gcs_folder: The GCS folder to use.
gcs_bucket_name: The GCS bucket to use.
Returns:
A pandas dataframe with the transcription data.
"""
match transcription_service:
case Utils.TranscriptionService.GEMINI:
return _transcribe_gemini(
output_dir, audio_file_path, gcs_folder, gcs_bucket_name
)
case Utils.TranscriptionService.WHISPER | _:
return _transcribe_whisper(output_dir, audio_file_path)


def _transcribe_gemini(
output_dir: str,
audio_file_path: str,
gcs_folder: str,
gcs_bucket_name: str,
) -> Tuple[pd.DataFrame, str, float]:
"""Transcribes audio using Gemini."""
transcription_dataframe = pd.DataFrame()
video_language = ConfigService.DEFAULT_VIDEO_LANGUAGE
language_probability = 0.0

vertexai.init(
project=ConfigService.GCP_PROJECT_ID,
location=ConfigService.GCP_LOCATION,
)
transcription_model = (
GenerativeModel(ConfigService.CONFIG_TRANSCRIPTION_MODEL)
)
audio_file_gcs_uri = f'gs://{gcs_bucket_name}/{gcs_folder}' + (
f'/{ConfigService.OUTPUT_ANALYSIS_CHUNKS_DIR}'
if ConfigService.OUTPUT_ANALYSIS_CHUNKS_DIR in audio_file_path else ''
) + audio_file_path.replace(output_dir, '')
try:
response = transcription_model.generate_content(
[
Part.from_uri(audio_file_gcs_uri, mime_type='audio/wav'),
ConfigService.TRANSCRIBE_AUDIO_PROMPT,
],
generation_config=ConfigService.TRANSCRIBE_AUDIO_CONFIG,
safety_settings=ConfigService.CONFIG_DEFAULT_SAFETY_CONFIG,
)
if (
response.candidates and response.candidates[0].content.parts
and response.candidates[0].content.parts[0].text
):
text = response.candidates[0].content.parts[0].text
result = (
re.search(ConfigService.TRANSCRIBE_AUDIO_PATTERN, text, re.DOTALL)
)
logging.info('TRANSCRIPTION - %s', text)
video_language = result.group(1)
language_probability = result.group(2)
transcription_dataframe = (
pd.read_csv(io.StringIO(result.group(3)), usecols=[
0, 1, 2
]).dropna(axis=1, how='all').rename(
columns={
'Start': 'start_s',
'End': 'end_s',
'Transcription': 'transcript',
}
).assign(
audio_segment_id=lambda df: range(1,
len(df) + 1),
start_s=lambda df: df['start_s'].
apply(Utils.timestring_to_seconds),
end_s=lambda df: df['end_s'].apply(Utils.timestring_to_seconds),
duration_s=lambda df: df['end_s'] - df['start_s'],
)
)
subtitles_output_path = audio_file_path.replace(
'wav', ConfigService.OUTPUT_SUBTITLES_TYPE
)
with open(subtitles_output_path, 'w', encoding='utf8') as f:
f.write(result.group(4))
logging.info(
'TRANSCRIPTION - transcript for %s written successfully!',
audio_file_path,
)
else:
logging.warning(
'Could not transcribe audio! Returning empty transcription...'
)
# Execution should continue regardless of the underlying exception
# pylint: disable=broad-exception-caught
except Exception:
logging.exception(
'Encountered error during transcription! '
'Returning empty transcription...'
)

return transcription_dataframe, video_language, float(language_probability)


def _transcribe_whisper(
output_dir: str,
audio_file_path: str,
) -> Tuple[pd.DataFrame, str, float]:
"""Transcribes audio using Whisper."""
model = WhisperModel(
ConfigService.CONFIG_WHISPER_MODEL,
ConfigService.CONFIG_TRANSCRIPTION_MODEL,
device=ConfigService.DEVICE,
compute_type='int8',
)
Expand Down
38 changes: 34 additions & 4 deletions service/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,17 @@
GCP_LOCATION = os.environ.get('GCP_LOCATION', 'us-central1')
CONFIG_TEXT_MODEL = os.environ.get('CONFIG_TEXT_MODEL', 'gemini-1.5-flash')
CONFIG_VISION_MODEL = os.environ.get('CONFIG_VISION_MODEL', 'gemini-1.5-flash')
CONFIG_WHISPER_MODEL = os.environ.get('CONFIG_WHISPER_MODEL', 'small')
CONFIG_TRANSCRIPTION_SERVICE = os.environ.get(
'CONFIG_TRANSCRIPTION_SERVICE', 'whisper'
)
CONFIG_TRANSCRIPTION_MODEL = os.environ.get(
'CONFIG_TRANSCRIPTION_MODEL', 'small'
)
CONFIG_ANNOTATIONS_CONFIDENCE_THRESHOLD = float(
os.environ.get('CONFIG_ANNOTATIONS_CONFIDENCE_THRESHOLD', '0.7')
)
CONFIG_MULTIMODAL_ASSET_GENERATION = os.environ.get(
'CONFIG_MULTIMODAL_ASSET_GENERATION', 'false'
'CONFIG_MULTIMODAL_ASSET_GENERATION', 'true'
) == 'true'
CONFIG_MAX_VIDEO_CHUNK_SIZE = float(
os.environ.get(
Expand Down Expand Up @@ -105,7 +110,6 @@
'max_output_tokens': 2048,
'temperature': 0.2,
'top_p': 1,
'top_k': 16,
}

# pylint: disable=line-too-long
Expand Down Expand Up @@ -146,7 +150,33 @@
'max_output_tokens': 2048,
'temperature': 0.2,
'top_p': 1,
'top_k': 32,
}

DEFAULT_VIDEO_LANGUAGE = 'English'

TRANSCRIBE_AUDIO_PROMPT = """Transcribe the provided audio file, paying close attention to speaker changes and pauses in speech.
Output the following, in this order:
1. **Language:** Specify the language of the audio (e.g., "Language: English")
2. **Confidence:** Specify the confidence score of the transcription (e.g., "Confidence: 0.95")
3. **Transcription CSV:** Output the transcription in CSV (Comma-Separated Values) format (e.g. ```csv<output>```) with these columns:
* **Start:** (Start timestamp for each utterance in the format "mm:ss.SSS")
* **End:** (End timestamp for each utterance in the format "mm:ss.SSS")
* **Transcription:** (The transcribed text of the utterance)
Ensure each row in the CSV corresponds to a complete sentence or a meaningful phrase. Sentences by different speakers, even if related, should not be grouped together.
**Critical Timestamping Requirements:**
* **Pause Detection:** It is absolutely essential to accurately identify and incorporate pauses in speech. If there is a period of silence between utterances, even a brief one, this MUST be reflected in the timestamps. Do not assume continuous speech.
* **No Overlapping:** Timestamps for consecutive sentences should NOT overlap. The end timestamp of one sentence should be the start timestamp of the next sentence ONLY if there is no pause between them.
4. **WebVTT Format:** Output the transcription information in WebVTT format, surrounded by backticks (e.g. ```vtt<output>```)
**Constraints:**
* **No Extra Text:** Only output the language, confidence, table, and WebVTT data, without any additional text or explanations. This includes avoiding any labels or headings before or after the transcription table and WebVTT data.
* **Valid Timestamps:** All timestamps MUST be within the actual duration of the audio. No timestamps should exceed the total length of the audio. This is absolutely critical.
* **Sequential Timestamps:** Timestamps should progress sequentially and logically from the beginning to the end of the audio.
"""
TRANSCRIBE_AUDIO_CONFIG = {
'max_output_tokens': 8192,
'temperature': 0.2,
'top_p': 1,
}
TRANSCRIBE_AUDIO_PATTERN = '.*Language: ?(.*)\n*.*Confidence: ?(.*)\n*```csv\n(.*)```\n*```vtt\n(.*)```'
31 changes: 27 additions & 4 deletions service/extractor/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,16 +168,20 @@ def _process_audio(
input_audio_file_path: str,
gcs_folder: str,
gcs_bucket_name: str,
analyse_audio: bool,
transcription_service: Utils.TranscriptionService,
) -> pd.DataFrame:
transcription_dataframe = pd.DataFrame()

if input_audio_file_path and analyse_audio:
if (
input_audio_file_path
and transcription_service != Utils.TranscriptionService.NONE
):
transcription_dataframe = _process_video_with_audio(
output_dir,
input_audio_file_path,
gcs_folder,
gcs_bucket_name,
transcription_service,
)
else:
_process_video_without_audio(output_dir, gcs_folder, gcs_bucket_name)
Expand All @@ -190,6 +194,7 @@ def _process_video_with_audio(
input_audio_file_path: str,
gcs_folder: str,
gcs_bucket_name: str,
transcription_service: Utils.TranscriptionService,
) -> pd.DataFrame:
audio_chunks = _get_audio_chunks(
output_dir=output_dir,
Expand All @@ -215,6 +220,9 @@ def _process_video_with_audio(
output_dir=(audio_output_dir if size > 1 else output_dir),
index=index + 1,
audio_file_path=audio_file_path,
transcription_service=transcription_service,
gcs_folder=gcs_folder,
gcs_bucket_name=gcs_bucket_name,
): index
for index, audio_file_path in enumerate(audio_chunks)
}
Expand Down Expand Up @@ -355,18 +363,26 @@ def _analyse_audio(
output_dir: str,
index: int,
audio_file_path: str,
transcription_service: Utils.TranscriptionService,
gcs_folder: str,
gcs_bucket_name=str,
) -> Tuple[str, str, str, str, float]:
"""Runs audio analysis in parallel."""
vocals_file_path = None
music_file_path = None
transcription_dataframe = None

with concurrent.futures.ProcessPoolExecutor(max_workers=2) as process_executor:
with (
concurrent.futures.ProcessPoolExecutor(max_workers=2) as process_executor
):
futures_dict = {
process_executor.submit(
AudioService.transcribe_audio,
output_dir=output_dir,
audio_file_path=audio_file_path,
transcription_service=transcription_service,
gcs_folder=gcs_folder,
gcs_bucket_name=gcs_bucket_name,
): 'transcribe_audio',
process_executor.submit(
AudioService.split_audio,
Expand Down Expand Up @@ -513,6 +529,12 @@ def extract(self):
bucket_name=self.gcs_bucket_name,
)
input_audio_file_path = AudioService.extract_audio(input_video_file_path)
if input_audio_file_path:
StorageService.upload_gcs_dir(
source_directory=tmp_dir,
bucket_name=self.gcs_bucket_name,
target_dir=self.video_file.gcs_folder,
)
annotation_results = None
transcription_dataframe = pd.DataFrame()

Expand All @@ -524,7 +546,8 @@ def extract(self):
input_audio_file_path=input_audio_file_path,
gcs_folder=self.video_file.gcs_folder,
gcs_bucket_name=self.gcs_bucket_name,
analyse_audio=self.video_file.video_metadata.analyse_audio,
transcription_service=self.video_file.video_metadata.
transcription_service,
): 'process_audio',
process_executor.submit(
_process_video,
Expand Down
4 changes: 2 additions & 2 deletions service/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

faster-whisper==1.0.2
faster-whisper==1.0.3
ffmpeg==1.4
ffprobe==0.5
functions-framework==3.7.0
Expand All @@ -24,7 +24,7 @@ google-cloud-storage==2.16.0
google-cloud-videointelligence==2.13.3
iso-639==0.4.5
numpy==1.26.4
openai-whisper==20231117
openai-whisper==20240930
pandas==1.5.3
protobuf==3.19.6
spleeter==2.4.0
Expand Down
2 changes: 1 addition & 1 deletion service/update_config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@

# Update runtime environment variables without having to fully redeploy the cloud function.
gcloud functions deploy vigenair \
--update-env-vars CONFIG_WHISPER_MODEL=large,CONFIG_TEXT_MODEL=gemini-1.5-pro,CONFIG_MULTIMODAL_ASSET_GENERATION='true'
--update-env-vars CONFIG_TRANSCRIPTION_SERVICE=gemini,CONFIG_TRANSCRIPTION_MODEL=gemini-1.5-flash
Loading

0 comments on commit 43175f1

Please sign in to comment.