Skip to content

Commit

Permalink
experimental[minor]: Create Closed Captioning Chain for .mp4 videos (…
Browse files Browse the repository at this point in the history
…#14059)

Description: Video imagery to text (Closed Captioning)
This pull request introduces the VideoCaptioningChain, a tool for
automated video captioning. It processes audio and video to generate
subtitles and closed captions, merging them into a single SRT output.

Issue: langchain-ai/langchain#11770
Dependencies: opencv-python, ffmpeg-python, assemblyai, transformers,
pillow, torch, openai
Tag maintainer:
@baskaryan
@hwchase17


Hello!

We are a group of students from the University of Toronto
(@LunarECL, @TomSadan, @nicoledroi1, @A2113S) that want to make a
contribution to the LangChain community! We have ran make format, make
lint and make test locally before submitting the PR. To our knowledge,
our changes do not introduce any new errors.

Thank you for taking the time to review our PR!

---------

Co-authored-by: Bagatur <[email protected]>
  • Loading branch information
2 people authored and Je-Cp committed Apr 2, 2024
1 parent 8cc2d54 commit 735db8c
Show file tree
Hide file tree
Showing 13 changed files with 1,343 additions and 9 deletions.
174 changes: 174 additions & 0 deletions cookbook/video_captioning/video_captioning.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Video Captioning\n",
"This notebook shows how to use VideoCaptioningChain, which is implemented using Langchain's ImageCaptionLoader and AssemblyAI to produce .srt files.\n",
"\n",
"This system autogenerates both subtitles and closed captions from a video URL."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing Dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# !pip install ffmpeg-python\n",
"# !pip install assemblyai\n",
"# !pip install opencv-python\n",
"# !pip install torch\n",
"# !pip install pillow\n",
"# !pip install transformers\n",
"# !pip install langchain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-30T03:39:14.078232Z",
"start_time": "2023-11-30T03:39:12.534410Z"
}
},
"outputs": [],
"source": [
"import getpass\n",
"\n",
"from langchain.chains.video_captioning import VideoCaptioningChain\n",
"from langchain.chat_models.openai import ChatOpenAI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up API Keys"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2023-11-30T03:39:17.423806Z",
"start_time": "2023-11-30T03:39:17.417945Z"
}
},
"outputs": [],
"source": [
"OPENAI_API_KEY = getpass.getpass(\"OpenAI API Key:\")\n",
"\n",
"ASSEMBLYAI_API_KEY = getpass.getpass(\"AssemblyAI API Key:\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Required parameters:**\n",
"\n",
"* llm: The language model this chain will use to get suggestions on how to refine the closed-captions\n",
"* assemblyai_key: The API key for AssemblyAI, used to generate the subtitles\n",
"\n",
"**Optional Parameters:**\n",
"\n",
"* verbose (Default: True): Sets verbose mode for downstream chain calls\n",
"* use_logging (Default: True): Log the chain's processes in run manager\n",
"* frame_skip (Default: None): Choose how many video frames to skip during processing. Increasing it results in faster execution, but less accurate results. If None, frame skip is calculated manually based on the framerate Set this to 0 to sample all frames\n",
"* image_delta_threshold (Default: 3000000): Set the sensitivity for what the image processor considers a change in scenery in the video, used to delimit closed captions. Higher = less sensitive\n",
"* closed_caption_char_limit (Default: 20): Sets the character limit on closed captions\n",
"* closed_caption_similarity_threshold (Default: 80): Sets the percentage value to how similar two closed caption models should be in order to be clustered into one longer closed caption\n",
"* use_unclustered_video_models (Default: False): If true, closed captions that could not be clustered will be included. May result in spontaneous behaviour from closed captions such as very short lasting captions or fast-changing captions. Enabling this is experimental and not recommended"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example run"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# https://ia804703.us.archive.org/27/items/uh-oh-here-we-go-again/Uh-Oh%2C%20Here%20we%20go%20again.mp4\n",
"# https://ia601200.us.archive.org/9/items/f58703d4-61e6-4f8f-8c08-b42c7e16f7cb/f58703d4-61e6-4f8f-8c08-b42c7e16f7cb.mp4\n",
"\n",
"chain = VideoCaptioningChain(\n",
" llm=ChatOpenAI(model=\"gpt-4\", max_tokens=4000, openai_api_key=OPENAI_API_KEY),\n",
" assemblyai_key=ASSEMBLYAI_API_KEY,\n",
")\n",
"\n",
"srt_content = chain.run(\n",
" video_file_path=\"https://ia601200.us.archive.org/9/items/f58703d4-61e6-4f8f-8c08-b42c7e16f7cb/f58703d4-61e6-4f8f-8c08-b42c7e16f7cb.mp4\"\n",
")\n",
"\n",
"print(srt_content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing output to .srt file"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"with open(\"output.srt\", \"w\") as file:\n",
" file.write(srt_content)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "myenv",
"language": "python",
"name": "myenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
},
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from langchain_experimental.video_captioning.base import VideoCaptioningChain

__all__ = ["VideoCaptioningChain"]
148 changes: 148 additions & 0 deletions libs/experimental/langchain_experimental/video_captioning/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
from typing import Any, Dict, List, Optional

from langchain.chains.base import Chain
from langchain_core.callbacks import CallbackManagerForChainRun
from langchain_core.language_models import BaseLanguageModel
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import Extra

from langchain_experimental.video_captioning.services.audio_service import (
AudioProcessor,
)
from langchain_experimental.video_captioning.services.caption_service import (
CaptionProcessor,
)
from langchain_experimental.video_captioning.services.combine_service import (
CombineProcessor,
)
from langchain_experimental.video_captioning.services.image_service import (
ImageProcessor,
)
from langchain_experimental.video_captioning.services.srt_service import SRTProcessor


class VideoCaptioningChain(Chain):
"""
Video Captioning Chain.
"""

llm: BaseLanguageModel
assemblyai_key: str
prompt: Optional[PromptTemplate] = None
verbose: bool = True
use_logging: Optional[bool] = True
frame_skip: int = -1
image_delta_threshold: int = 3000000
closed_caption_char_limit: int = 20
closed_caption_similarity_threshold: int = 80
use_unclustered_video_models: bool = False

class Config:
extra = Extra.allow
arbitrary_types_allowed = True

@property
def input_keys(self) -> List[str]:
return ["video_file_path"]

@property
def output_keys(self) -> List[str]:
return ["srt"]

def _call(
self,
inputs: Dict[str, Any],
run_manager: Optional[CallbackManagerForChainRun] = None,
) -> Dict[str, str]:
if "video_file_path" not in inputs:
raise ValueError(
"Missing 'video_file_path' in inputs for video captioning."
)
video_file_path = inputs["video_file_path"]
nl = "\n"

run_manager.on_text(
"Loading processors..." + nl
) if self.use_logging and run_manager else None

audio_processor = AudioProcessor(api_key=self.assemblyai_key)
image_processor = ImageProcessor(
frame_skip=self.frame_skip, threshold=self.image_delta_threshold
)
caption_processor = CaptionProcessor(
llm=self.llm,
verbose=self.verbose,
similarity_threshold=self.closed_caption_similarity_threshold,
use_unclustered_models=self.use_unclustered_video_models,
)
combine_processor = CombineProcessor(
llm=self.llm,
verbose=self.verbose,
char_limit=self.closed_caption_char_limit,
)
srt_processor = SRTProcessor()

run_manager.on_text(
"Finished loading processors."
+ nl
+ "Generating subtitles from audio..."
+ nl
) if self.use_logging and run_manager else None

# Get models for speech to text subtitles
audio_models = audio_processor.process(video_file_path, run_manager)
run_manager.on_text(
"Finished generating subtitles:"
+ nl
+ f"{nl.join(str(obj) for obj in audio_models)}"
+ nl
+ "Generating closed captions from video..."
+ nl
) if self.use_logging and run_manager else None

# Get models for image frame description
image_models = image_processor.process(video_file_path, run_manager)
run_manager.on_text(
"Finished generating closed captions:"
+ nl
+ f"{nl.join(str(obj) for obj in image_models)}"
+ nl
+ "Refining closed captions..."
+ nl
) if self.use_logging and run_manager else None

# Get models for video event closed-captions
video_models = caption_processor.process(image_models, run_manager)
run_manager.on_text(
"Finished refining closed captions:"
+ nl
+ f"{nl.join(str(obj) for obj in video_models)}"
+ nl
+ "Combining subtitles with closed captions..."
+ nl
) if self.use_logging and run_manager else None

# Combine the subtitle models with the closed-caption models
caption_models = combine_processor.process(
video_models, audio_models, run_manager
)
run_manager.on_text(
"Finished combining subtitles with closed captions:"
+ nl
+ f"{nl.join(str(obj) for obj in caption_models)}"
+ nl
+ "Generating SRT file..."
+ nl
) if self.use_logging and run_manager else None

# Convert the combined model to SRT format
srt_content = srt_processor.process(caption_models)
run_manager.on_text(
"Finished generating srt file." + nl
) if self.use_logging and run_manager else None

return {"srt": srt_content}

@property
def _chain_type(self) -> str:
return "video_captioning_chain"
Loading

0 comments on commit 735db8c

Please sign in to comment.