-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Speaker labels (Diarization) #74
Comments
I don't know if it helps but here is the most successful Notebook I know to perform this, maybe it's adaptable in rust? https://github.com/MahmoudAshraf97/whisper-diarization/tree/main |
Thanks! Most of the implementation in python uses pyannotate We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad |
Or cheat a bit ? https://rustpython.github.io (never used it myself though) |
Looks like a useful crate! but I hope we can continue to avoid using Python for as long as possible to maintain top-notch performance and quality. |
I'm not an expert in this area maybe this could be very stupid but I tried to create a minimal Python script using an onnx model (heavily dependent on https://github.com/pengzhendong/pyannote-onnx) to gain more insight into the process of converting this to rust (using ort for rust). Converting this script into the Rust doesn't seem like a big deal to me, but of course, I might be missing something critical here (especially on the segmentation part) haha warning: I tested this only with this file: https://github.com/pengzhendong/pyannote-onnx/blob/master/data/test_16k.wav so may not be appropriate with a general solutions... import numpy as np
import onnxruntime as ort
import soundfile as sf
from itertools import permutations
class MinimalSpeakerDiarization:
def __init__(self, model_path):
self.num_classes = 4
self.vad_sr = 16000
self.duration = 10 * self.vad_sr
self.session = ort.InferenceSession(model_path)
def sample2frame(self, x):
return (x - 721) // 270
def frame2sample(self, x):
return (x * 270) + 721
def sliding_window(self, waveform, window_size, step_size):
start = 0
num_samples = len(waveform)
while start <= num_samples - window_size:
yield waveform[start : start + window_size]
start += step_size
if start < num_samples:
last_window = np.pad(waveform[start:], (0, window_size - (num_samples - start)))
yield last_window
def reorder(self, x, y):
perms = [np.array(perm).T for perm in permutations(y.T)]
diffs = np.sum(np.abs(np.sum(np.array(perms)[:, : x.shape[0], :] - x, axis=1)), axis=1)
return perms[np.argmin(diffs)]
def process_audio(self, audio_path):
wav, sr = sf.read(audio_path)
if sr != self.vad_sr:
raise ValueError(f"Audio sample rate {sr} does not match required {self.vad_sr}")
wav = wav.astype(np.float32)
step = 5 * self.vad_sr
step = max(min(step, int(0.9 * self.duration)), self.duration // 2)
overlap = self.sample2frame(self.duration - step)
overlap_chunk = np.zeros((overlap, self.num_classes), dtype=np.float32)
results = []
for window in self.sliding_window(wav, self.duration, step):
window = window.astype(np.float32)
ort_outs = np.exp(self.session.run(None, {"input": window[None, None, :]})[0][0])
ort_outs = np.concatenate(
(
1 - ort_outs[:, :1], # speech probabilities
self.reorder(
overlap_chunk[:, 1 : self.num_classes],
ort_outs[:, 1 : self.num_classes],
), # speaker probabilities
),
axis=1,
)
if len(results) > 0:
ort_outs[:overlap, :] = (ort_outs[:overlap, :] + overlap_chunk) / 2
overlap_chunk = ort_outs[-overlap:, :]
results.extend(ort_outs[:-overlap])
return np.array(results)
def get_speech_segments_with_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
speech_prob = results[:, 0]
speaker_probs = results[:, 1:]
segments = []
in_speech = False
start = 0
# First, determine active speakers
speech_duration = np.sum(speaker_probs > threshold, axis=0)
speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
active_speakers = np.where(speech_duration_ms > min_speech_duration_ms)[0]
for i, (speech, speakers) in enumerate(zip(speech_prob, speaker_probs)):
if not in_speech and speech >= threshold:
start = i
in_speech = True
elif in_speech and speech < threshold:
speaker_index = np.argmax(np.mean(speaker_probs[start:i], axis=0))
if speaker_index in active_speakers:
speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
segments.append({
'start': self.frame2sample(start) / self.vad_sr,
'end': self.frame2sample(i) / self.vad_sr,
'speaker': speaker
})
in_speech = False
if in_speech:
speaker_index = np.argmax(np.mean(speaker_probs[start:], axis=0))
if speaker_index in active_speakers:
speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
segments.append({
'start': self.frame2sample(start) / self.vad_sr,
'end': self.frame2sample(len(speech_prob)) / self.vad_sr,
'speaker': speaker
})
return segments, len(active_speakers)
def get_num_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
speaker_probs = results[:, 1:]
speech_duration = np.sum(speaker_probs > threshold, axis=0)
speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
return np.sum(speech_duration_ms > min_speech_duration_ms)
if __name__ == "__main__":
model_path = "segmentation-3.0.onnx"
audio_path = "test_16k.wav"
diarizer = MinimalSpeakerDiarization(model_path)
results = diarizer.process_audio(audio_path)
speech_segments, num_speakers = diarizer.get_speech_segments_with_speakers(results)
print("Speech segments with speakers:")
for segment in speech_segments:
print(f"Start: {segment['start']:.2f}s, End: {segment['end']:.2f}s, Speaker: {segment['speaker']}")
print(f"Number of speakers detected: {num_speakers}")`
note model: https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx |
Thanks for helping :) |
Very creative!! this probably provides more accurate diarization across various languages and word lengths. |
Great :) |
I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. |
nice! and thank you for your contributions maybe we should continue to talk in sherpa-rs... |
It is fixed in |
@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks! |
Thanks for interest :)
I think the best chance is to add pyannote to k2-fsa/sherpa-onnx#1197 |
ah i see. thank you! appreciate the quick response |
Some updates: on macOS with medium model for 40s audio it takes 7s normally and 15s with diarization. The diarization is fast. like 30s for 1 hour. Todo: download models instead of embedding into the exe to keep the exe lightweight. |
Speaker diarization released! (Beta) Few things about it
|
exciting news! |
Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207 |
I just released stable release including for 22.04 By the way on Linux I strongly recommend to use the tiny model for speed |
tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription, I found the Sherpa version models nicer on my tests at least. :) Note: maybe I should play with params more... |
Thanks now it works.
Same here, tiny model gives too inaccurate result to be actually practical. Even with higher temperature than default.
Fortunately, medium model works too on Linux with diarization, even if slow. Diarization works well! However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):
To my mind it should Ideally not split the successive content of a same speaker into several labels, but rather several paragraphs under the same unique label, i.e.
|
Could you please open a separate issue for this? Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :) |
and also in rust! |
Goal
Provide speaker labels along with the transcriptions (eg.
Speaker1: ...
,Speaker2: ...
)Do it in the same time when transcribing efficient and lightweight.
Research
https://github.com/wq2012/awesome-diarization
Possible ways:
Use
c/c++
diarization libs in Rust using bindgenReplicate pyannote-audio to
Rust
with tch-rsUse ONNX runtime with ort
pykeio/ort#208
pyannote/pyannote-audio#1322
Best combination:
pyannote-segmentation-30
WespeakerVoxcelebResnet34LM
The text was updated successfully, but these errors were encountered: