Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a standalone long-form transcription demo #48

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ This repo hosts the inference code for Moonshine.
- [Examples](#examples)
- [Onnx standalone](#onnx-standalone)
- [Live Captions](#live-captions)
- [Long-form transcription](#long-form-transcription)
- [CTranslate2](#ctranslate2)
- [TODO](#todo)
- [Citation](#citation)
Expand Down Expand Up @@ -131,17 +132,21 @@ The latest versions of the Onnx Moonshine models are available on HuggingFace at

You can try the Moonshine models with live input from a microphone on many platforms with the [live captions demo](/moonshine/demo/README.md#demo-live-captioning-from-microphone-input).

### Long-form transcription

A common approach to "long-form" transcription involves segmenting speech before running a model transcription for each segment. A single transcription is then assembled from the results. One method for segmentation is to locate pauses in the speech. You can try the Moonshine models with this segmentation method for long-form WAV files in the [file transcription demo](/moonshine/demo/README.md#demo-standalone-long-form-file-transcription).

### CTranslate2

The files for the CTranslate2 versions of Moonshine are available at [huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2](https://huggingface.co/UsefulSensors/moonshine/tree/main/ctranslate2), but they require [a pull request to be merged](https://github.com/OpenNMT/CTranslate2/pull/1808) before they can be used with the mainline version of the framework. Until then, you should be able to try them with [our branch](https://github.com/njeffrie/CTranslate2/tree/master), with [this example script](https://github.com/OpenNMT/CTranslate2/pull/1808#issuecomment-2439725339).

## TODO
* [x] Live transcription demo

* [x] ONNX model

* [ ] CTranslate2 support

* [ ] MLX support

* [ ] Fine-tuning code
Expand All @@ -152,12 +157,12 @@ The files for the CTranslate2 versions of Moonshine are available at [huggingfac
If you benefit from our work, please cite us:
```
@misc{jeffries2024moonshinespeechrecognitionlive,
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
year={2024},
eprint={2410.15608},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2410.15608},
url={https://arxiv.org/abs/2410.15608},
}
```
Binary file added moonshine/assets/a_tale_of_two_cities.wav
Binary file not shown.
54 changes: 53 additions & 1 deletion moonshine/demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,19 @@ Moonshine ASR models.
- [Moonshine Demos](#moonshine-demos)
- [Demo: Standalone file transcription with ONNX](#demo-standalone-file-transcription-with-onnx)
- [Demo: Live captioning from microphone input](#demo-live-captioning-from-microphone-input)
- [Installation.](#installation)
- [Installation](#installation)
- [0. Setup environment](#0-setup-environment)
- [1. Clone the repo and install extra dependencies](#1-clone-the-repo-and-install-extra-dependencies)
- [Ubuntu: Install PortAudio](#ubuntu-install-portaudio)
- [Running the demo](#running-the-demo)
- [Script notes](#script-notes)
- [Speech truncation and hallucination](#speech-truncation-and-hallucination)
- [Running on a slower processor](#running-on-a-slower-processor)
- [Metrics](#metrics)
- [Demo: Standalone long-form file transcription](#demo-standalone-long-form-file-transcription)
- [Installation](#installation-1)
- [Running the demo](#running-the-demo-1)
- [Script notes](#script-notes-1)
- [Citation](#citation)


Expand Down Expand Up @@ -176,6 +181,53 @@ The value of `MIN_REFRESH_SECS` will be ineffective when the model inference tim

The metrics shown on program exit will vary based on the talker's speaking style. If the talker speaks with more frequent pauses, the speech segments are shorter and the mean inference time will be lower. This is a feature of the Moonshine model described in [our paper](https://arxiv.org/abs/2410.15608). When benchmarking, use the same speech, e.g., a recording of someone talking.

# Demo: Standalone long-form file transcription

The script [`file_transcription.py`](/moonshine/demo/file_transcription.py)
demonstrates "long-form" transcription using a WAV file as input to the
Moonshine ONNX model. The demo loads a WAV file of length 1.5 minutes.

## Installation

Follow the [same installation steps](#installation) used for live captions demo.

## Running the demo

``` shell
python3 moonshine/moonshine/demo/file_transcription.py
```

An example run on Ubuntu 22.04 VM on MacBook Pro x86 with Moonshine base ONNX
model:

```console
(env_moonshine_demo) parallels@ubuntu-linux-22-04-02-desktop:~$ python3 moonshine/moonshine/demo/file_transcription.py

It was the best of times, it was the worst of times. It was the age of wisdom, it was the age of foolishness. It was the epoch of belief, it was the epoch of incredulity. It was the season of light, it was the season of darkness. It was the spring of hope, it was the winter of despair. We had everything before us, we had nothing before us. We were all going direct to heaven, we were all going direct the other way. In short, the period was so far like the present period that some of its noisiest authorities insisted on its being received for good or for evil in the superlative degree of comparison only. There were a king with a large jaw and a queen with a plain face on the throne of England. There were a king with a large jaw and a queen with a fair face on the throne of France. In both countries it was clearer than crystal to the lords of the state preserves of loaves and fishes that things in general were settled forever. It was the year of our Lord 1775.

model realtime factor: 10.31x

(env_moonshine_demo) parallels@ubuntu-linux-22-04-02-desktop:~$
```

You may load other WAV files using the command line argument `--wav_path`.

## Script notes

This demo script uses
[`silero-vad`](https://github.com/snakers4/silero-vad) voice activity detector
to segment the speech based on talker pauses. The parameters used in our script
are the same used in faster-whisper's
[implementation](https://github.com/SYSTRAN/faster-whisper/blob/814472fdbf7faf5d77d65cdb81b1528c0dead02a/faster_whisper/vad.py#L14)
for silero-vad. We validated these parameters for Moonshine base model by WER
testing several long-form datasets and saw similar WER values compared with
OpenAI Whisper base.en and faster-whisper base.en models.

We adopt a simple strategy of concatenation of the predicted texts for this
demo. We note there are other published methods such as overlap and common
sequence matching and thus we see room for improvement on our demo method. For
instance other methods may generate more accurate transcriptions for talkers who
rarely pause when speaking for extended periods.

# Citation

Expand Down
88 changes: 88 additions & 0 deletions moonshine/demo/file_transcription.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
"""WAV file long-form transcription with Moonshine ONNX models."""

import argparse
import os
import sys
import time
import wave

import numpy as np
import tokenizers

from silero_vad import get_speech_timestamps, load_silero_vad

MOONSHINE_DEMO_DIR = os.path.dirname(__file__)
sys.path.append(os.path.join(MOONSHINE_DEMO_DIR, ".."))

from onnx_model import MoonshineOnnxModel


def main(model_name, wav_path):
model = MoonshineOnnxModel(model_name=model_name)

tokenizer = tokenizers.Tokenizer.from_file(
os.path.join(MOONSHINE_DEMO_DIR, "..", "assets", "tokenizer.json")
)

with wave.open(wav_path) as f:
params = f.getparams()
assert (
params.nchannels == 1
and params.framerate == 16000
and params.sampwidth == 2
), f"WAV file must have 1 channel, 16KHz rate, and int16 precision."
audio = f.readframes(params.nframes)
audio = np.frombuffer(audio, np.int16) / np.iinfo(np.int16).max
audio = audio.astype(np.float32)

vad_model = load_silero_vad()
speech_timestamps = get_speech_timestamps(
audio,
vad_model,
max_speech_duration_s=30,
min_silence_duration_ms=2000,
min_speech_duration_ms=250,
speech_pad_ms=400,
)
chunks = [audio[ts["start"] : ts["end"]] for ts in speech_timestamps]

chunks_length = 0
transcription = ""

start_time = time.time()

for chunk in chunks:
tokens = model.generate(chunk[None, ...])
transcription += tokenizer.decode_batch(tokens)[0] + " "

chunks_length += len(chunk)

time_took = time.time() - start_time

print(f"""
{transcription}

model realtime factor: {((chunks_length / 16000) / time_took):.2f}x
""")


if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog="file_transcription.py",
description="Standalone file transcription with Moonshine ONNX models.",
)
parser.add_argument(
"--model_name",
help="Model to run the demo with.",
default="moonshine/base",
choices=["moonshine/base", "moonshine/tiny"],
)
parser.add_argument(
"--wav_path",
help="Path to speech WAV file.",
default=os.path.join(
MOONSHINE_DEMO_DIR, "..", "assets", "a_tale_of_two_cities.wav"
),
)
args = parser.parse_args()
main(**vars(args))