Standalone Faster-Whisper-XXL features #231
Replies: 11 comments 53 replies
-
I really like the new parameter --vad_alt_method. Among them, silero_v3/silero_v4/pyannote_onnx_v3 are much better than the original VAD. For example, there will be some gaps in the original VAD, and for example, sentences starting with "So" will often have a delayed start of the timeline. These issues are resolved in the silero_v3/silero_v4/pyannote_onnx_v3 parameters. Finally, let me ask, which of the three parameters of silero_v3/silero_v4/pyannote_onnx_v3 has the best test results? Or what are their characteristics? |
Beta Was this translation helpful? Give feedback.
-
any hope of doing something similar for mac in the future? |
Beta Was this translation helpful? Give feedback.
-
A little annoyance with I'm running Faster-Whisper-XXL in a Nextcloud folder (with a cronjob checking if new audio files have been synchronized, then running faster-whisper-xxl). So far, this worked fine, but in r192.3.3 with MDX filtering enabled, it seems first the *_mdx.wav file is created and then it's moved to a temp folder (?). This move fails because Nextcloud already tries to sync the mdx file, and this leads to whisper-faster-xxl just quitting with an error that the *_mdx.wav file is already in use. I now set the Nextcloud rules to just ignore *_mdx.wav files, but would it be possible to create them in a temp folder from the start? |
Beta Was this translation helpful? Give feedback.
-
Do I need to use some kind of tag to make the recognition against a little noise or soft music better? |
Beta Was this translation helpful? Give feedback.
-
Such a great tool. Especially for those who aren't very saavy in Python or command line! Thanks for creating! Is it possible to perform speaker diarization with this standalone version? |
Beta Was this translation helpful? Give feedback.
-
Hey @Purfview , I was wondering if you have (or willing to run) any benchmarks that compare |
Beta Was this translation helpful? Give feedback.
-
Hi @Purfview, I did a test with the --ff_mdx_kim2 feature and it took a long time to complete, about 45min for a 10min video. Is the voice extraction feature processed using the GPU, or CPU? |
Beta Was this translation helpful? Give feedback.
-
Is there a series of parameters that work best to capture very short audio clips? My clips with just "Yes" or "Let's go" produce a blank transcription. I've adjusted --vad_min_speech_duration_ms and others, but nothing catches these short clips. |
Beta Was this translation helpful? Give feedback.
-
Is there any way to make auto dialogs to work?
instead of
Thanks! |
Beta Was this translation helpful? Give feedback.
-
Since this faster Whisper model has been modified from the original version, could you please upload the source code so the community can contribute and add new features or im i missing something? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Are |
Beta Was this translation helpful? Give feedback.
-
Includes all Standalone Faster-Whisper features +the additional ones mentioned below.
Includes all needed libs.
Vocal extraction model:
--ff_mdx_kim2
: Preprocess audio with MDX23 Kim vocal v2 model (thanks to Kimberley Jensen). [Better than HT Demucs v4 FT]Alternative VAD (Voice activity detection) methods:
--vad_method
choices:silero_v3
- Generally less accurate than v4, but doesn't have some quirks of v4.silero_v4
- Same assilero_v4_fw
. Runs original Silero's code instead of adapted one.silero_v5
- Same assilero_v5_fw
. Runs original Silero's code instead of adapted one.silero_v4_fw
- Default model. Most accurate Silero version, has some non-fatal quirks.silero_v5_fw
- Bad accuracy. Not a VAD, it's Random Detector of Some Speech :), has various fatal quirks. Avoid!pyannote_v3
- The best accuracy, supports CUDA.pyannote_onnx_v3
- Lite version ofpyannote_v3
. Similar accuracy to Silero v4, maybe a bit better, supports CUDA.webrtc
- Low accuracy, outdated VAD. Takes only 'vad_min_speech_duration_ms' & 'vad_speech_pad_ms'.auditok
- Actually it's not VAD, it's AAD - Audio Activity Detection.Speaker Diarization:
--diarize
choices:pyannote_v3.0
- Fastest for CPUpyannote_v3.1
- Same as v3.0 but should be faster with CUDAreverb_v1
- Allegedly better than pyannote v3reverb_v2
- The slowest, allegedly the bestFor more read and post there -> Speaker Diarization
Legal notice: Reverb models are only for personal non-profit use.
Latest CTranslate2:
Up to ~26% faster on CPU with the int8 quantizations.
Flash attention support, that's CUDA, but the benchmarks shows no effect on the performance.
Beta Was this translation helpful? Give feedback.
All reactions