The main idea is to search for text in videos (spoken or visible). Using rust and ffmpeg, tesseract, and whisper.rs
- Create a dump directory for the processed video.
- first, use ffmpeg to extract all keyframes for an .mp4 in the dump/frames directory.
- For each frame,
- A timestamp is calculated based on the frame number.
- OCR (optical character recognition) is applied
- The indexer is updated with the predicted words and their corresponding timestamp.
- Apply ASR (automatic speech recognition) using whisper.cpp
- For each (word,timestamp) in the predicted string, the indexer is updated.
for the indexer, two data structures were tested, a Trie and a Hashmap
Models directory should exist with the following structure:
|-- Models
| |-- cpp_whisper
| `-- traineddata
| |-- trainneddata_base
| | `-- eng.traineddata
| |-- trainneddata_fast
| | `-- eng.traineddata
| `-- trainneddata_best
| `-- eng.traineddata
Currently whisper-rs is used to bind to whisper.cpp
for more details, check whisper.cpp
- Currently only .mp4 videos are supported
- No sufficient Grammatical Error Correction crates available yet.
- No reliable stop words removal crate
- Currently supports English only
- Fix incorrect timestamps
- solve indexer bottleneck issue
- implement image preprocessing
- implement key-frame filtering