libfvad classifies any noice as a human voice. #23

ababo · 2020-04-30T12:50:35Z

Any noice or intense sound is classified as a human voice.

josharian · 2020-06-23T01:48:23Z

That's my experience, too.

pahlevan · 2020-06-23T18:35:09Z

me too

alamnasim · 2020-12-03T14:27:05Z

For me it classifies Music as Human voice. Can anyone confirm whether it is trained to detect Music as Human Voice or not?

jonnor · 2024-04-20T11:27:23Z

I have done some looking into this, and in my opinion these problems are due to the nature of the WebRTC Voice Activity Detection algorithm. It does online estimation that attempts to separate "background" (slowly changing) from "foreground" (rapidly changing). This is done using a Gaussian Mixture Model over 6 frequency sub-bands, with coefficients set to prefer speech bands. Conceptually, is an energy-based VAD with adaptive threshold.

So in practice, it acts more like a novelty detector - any (short) changes to the acoustical signal is considered a likely candidate to be "speech". This means that it is good for:

Separating between "silence" and speech.
Separating between slowly varying noise sources and speech. Say HVAC hum, car traffic at a distance, PC fan etc

And that it is not good for:

Separating repeated impulsive or intermittent noises from speech. Say keyboard clicking
Separating music from speech. Both vocals and non-vocal musical content
Separating backgrounds with a lot of near-constant noise, where the SNR of speech is low. Say standing close to a busy highway

So if those things are needed, one would need a more advanced algorithm. For example, a model trained on large dataets to separate speech from other sounds. This can possibly be done as a second stage after this VAD.
Or the filterbank in WebRTC VAD (which is very computationally efficient) could be used as features for such a supervised model. I am considering doing the latter as an example/demo for the https://github.com/emlearn/emlearn project.

RicardoEPRodrigues mentioned this issue Jun 19, 2024

Voice Activation gtreshchev/RuntimeSpeechRecognizer#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libfvad classifies any noice as a human voice. #23

libfvad classifies any noice as a human voice. #23

ababo commented Apr 30, 2020

josharian commented Jun 23, 2020

pahlevan commented Jun 23, 2020

alamnasim commented Dec 3, 2020

jonnor commented Apr 20, 2024 •

edited

Loading

libfvad classifies any noice as a human voice. #23

libfvad classifies any noice as a human voice. #23

Comments

ababo commented Apr 30, 2020

josharian commented Jun 23, 2020

pahlevan commented Jun 23, 2020

alamnasim commented Dec 3, 2020

jonnor commented Apr 20, 2024 • edited Loading

jonnor commented Apr 20, 2024 •

edited

Loading