This project extracts audio from video files ('.mp4', '.avi', '.mkv'), applies Voice Activity Detection (VAD) to filter out non-speech segments, and generates subtitles using the Whisper model.
We can clone the source code and run the tool from there. We can also use the pre-built docker image that is published here: https://hub.docker.com/r/hungdoan/video-transciption
- Extract audio from video files
- Apply VAD to filter out non-speech segments
- Transcribe audio to text using Whisper
- Generate subtitles in SRT format
- Python 3.7+
- Docker
- Docker Compose
- CUDA supported machine
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Ensure Docker and Docker Compose are installed and running on your machine.
There are five model sizes, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware.
Size | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|
tiny | tiny |
~1 GB | ~32x |
base | base |
~1 GB | ~16x |
small | small |
~2 GB | ~6x |
medium | medium |
~5 GB | ~2x |
large | large-v3 |
~10 GB | 1x |
(Source: Whisper)
-
Place your video files in the input directory.
-
Run the script using Docker Compose:
docker compose -p "video-transcript" --env-file .env --env-file .env.base up --build --remove-orphans
All configurable variables are defined in the .env files, where:
- MODEL_NAME: model name that we wanna use, see Available models section for the list. The default is
base
- DEVICE: It can be "cuda" or "cpu", the "cuda" is recommended if you have GPU(s) has CUDA cores. The default is
cuda
- MODEL_NAME: model name that we wanna use, see Available models section for the list. The default is
-
The output SRT files will be saved in the output directory.
Example:
To transcribe a video file named example.mp4
.
In the directory that contain the source code.
- Clone the source code
git clone <repository-url>
- Change working directory to the source code
cd <repository-directory>
- Place
example.mp4
in theinput
directory.cp example.mp4 <repository-directory>/input/
- Run the script using Docker Compose:
docker compose -p "video-transcript" --env-file .env --env-file .env.base up --build --remove-orphans
- The generated subtitle file
example.srt
will be saved in the [output
] directory.
Execute this script in your powershell
$input_dir = "D:/input"
$output_dir = "D:/output"
$model_cache_dir = "D:/model_caches"
$model_name = "base"
$device = "cuda"
docker pull hungdoan/video-transciption:latest
docker run --gpus=all --rm -it --env MODEL_NAME=${model_name} --env DEVICE=${device} -v ${input_dir}:/input -v ${output_dir}:/output -v ${model_cache_dir}:/root/.cache/whisper hungdoan/video-transciption:latest
Where:
- MODEL_NAME: model name that we wanna use, see Available models section for the list. The default is
base
- DEVICE: It can be "cuda" or "cpu", the "cuda" is recommended if you have GPU(s) has CUDA cores. The default is
cuda
-v ${input_dir}:/input
: we need to specify (mount) input folder so that the tool could scans input videos-v ${output_dir}:/output
: we need to specify (mount) output folder to save the file.-v ${model_cache_dir}:/root/.cache/whisper
: optionally, we can specify the model folder to keep downloaded model and reuse it.
This project is licensed under the MIT License. See the LICENSE file for details.
- Whisper - A general-purpose speech recognition model by OpenAI
- WebRTC VAD - Python interface to the WebRTC Voice Activity Detector
- PyAV - Pythonic bindings for FFmpeg's libraries