Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libfvad classifies any noice as a human voice. #23

Open
ababo opened this issue Apr 30, 2020 · 4 comments
Open

libfvad classifies any noice as a human voice. #23

ababo opened this issue Apr 30, 2020 · 4 comments

Comments

@ababo
Copy link

ababo commented Apr 30, 2020

Any noice or intense sound is classified as a human voice.

@josharian
Copy link

That's my experience, too.

@pahlevan
Copy link

me too

@alamnasim
Copy link

For me it classifies Music as Human voice. Can anyone confirm whether it is trained to detect Music as Human Voice or not?

@jonnor
Copy link
Contributor

jonnor commented Apr 20, 2024

I have done some looking into this, and in my opinion these problems are due to the nature of the WebRTC Voice Activity Detection algorithm. It does online estimation that attempts to separate "background" (slowly changing) from "foreground" (rapidly changing). This is done using a Gaussian Mixture Model over 6 frequency sub-bands, with coefficients set to prefer speech bands. Conceptually, is an energy-based VAD with adaptive threshold.

So in practice, it acts more like a novelty detector - any (short) changes to the acoustical signal is considered a likely candidate to be "speech". This means that it is good for:

  • Separating between "silence" and speech.
  • Separating between slowly varying noise sources and speech. Say HVAC hum, car traffic at a distance, PC fan etc

And that it is not good for:

  • Separating repeated impulsive or intermittent noises from speech. Say keyboard clicking
  • Separating music from speech. Both vocals and non-vocal musical content
  • Separating backgrounds with a lot of near-constant noise, where the SNR of speech is low. Say standing close to a busy highway

So if those things are needed, one would need a more advanced algorithm. For example, a model trained on large dataets to separate speech from other sounds. This can possibly be done as a second stage after this VAD.
Or the filterbank in WebRTC VAD (which is very computationally efficient) could be used as features for such a supervised model. I am considering doing the latter as an example/demo for the https://github.com/emlearn/emlearn project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants