Skip to content

Commit

Permalink
Final preparation for initial release
Browse files Browse the repository at this point in the history
  • Loading branch information
dscripka committed Dec 23, 2022
1 parent 7fd4757 commit bc3febc
Show file tree
Hide file tree
Showing 16 changed files with 382 additions and 89 deletions.
84 changes: 43 additions & 41 deletions README.md

Large diffs are not rendered by default.

12 changes: 7 additions & 5 deletions docs/models/alexa.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Model Description

A model trained to detect the presence of the spoken word "Alexa" in an audio recording
A model trained to detect the presence of the word "Alexa" in an audio recording of speech.

Other similar phrases such has "Hey Alexa" or "Alexa stop" may also work, but likely with higher false-reject rates. Similarly, a short pause after the speaking the wakeword is recommended, but the model may also detect the presence of the wakeword is a continuous stream of speech in certain cases.
Other similar phrases such as "Hey Alexa" or "Alexa stop" may also work, but likely with higher false-reject rates. Similarly, a short pause after the speaking the wakeword is recommended, but the model may also detect the presence of the wakeword is a continuous stream of speech in certain cases.

# Training Data

Expand All @@ -13,9 +13,9 @@ The model was trained on approximately ~100,000 synthetically generated clips of
1) [NVIDIA WAVEGLOW](https://github.com/NVIDIA/waveglow) with the LibriTTS multi-speaker model
2) [VITS](https://github.com/jaywalnut310/vits) with the VCTK multi-speaker model

Clips were generated both with the trained speakers, and also mixtures of individual speaker embeddings to produce novel voices. See the [Synthetic Data Generation]() documentation page for more details.
Clips were generated both with the trained speaker embeddings, and also mixtures of individual speaker embeddings to produce novel voices. See the [Synthetic Data Generation](../synthetic_data_generation.md) documentation page for more details.

The following phrases were included in the training data, all representing cases where the system is trained to detect the presence of the wakeword:
The following phrases were included in the training data:

1) "Alexa"
2) "Alexa `<random words>`"
Expand All @@ -38,12 +38,14 @@ In addition to the above, the total negative dataset also includes reverberated

# Test Data

The positive test examples of the "Alexa" wakeword are those included in [Picovoice's](https://github.com/Picovoice/wake-word-benchmark) repository. This examples are mixed with the HOME background noise from the [DEMAND](https://zenodo.org/record/1227121#.Y3OSG77MJhE) dataset at an SNR of 10 dB, and have simulated reverberation applied using the real room-impulse-response functions from the [Room Impulse Response and Noise](https://www.openslr.org/28/) dataset.
The positive test examples of the "Alexa" wakeword are those included in [Picovoice's](https://github.com/Picovoice/wake-word-benchmark) repository. This examples are mixed with the HOME recordings of background noise from the [DEMAND](https://zenodo.org/record/1227121#.Y3OSG77MJhE) dataset at an SNR of 10 dB, and have simulated reverberation applied using the real room-impulse-response functions from the [Room Impulse Response and Noise](https://www.openslr.org/28/) dataset.

# Performance

The false-accept/false-reject curve for the model on the test data is shown below. Decreasing the `threshold` parameter when using the model will increase the false-accept rate and decrease the false-reject rate.

![FPR/FRR curve for "alexa" pre-trained model](images/alexa_performance_plot.png)

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
51 changes: 51 additions & 0 deletions docs/models/hey_mycroft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Model Description

A model trained to detect the presence of the phrase "hey mycroft" in an audio recording of speech.

Other similar phrases such as just "mycroft" or may also work, but likely with higher false-reject rates. Similarly, a short pause after the speaking the wakeword is recommended, but the model may also detect the presence of the wakeword is a continuous stream of speech in certain cases.

# Training Data

## Positive Data

The model was trained on approximately ~100,000 synthetically generated clips of the "hey mycroft" wake phrase using two text-to-speech (TTS) models:

1) [NVIDIA WAVEGLOW](https://github.com/NVIDIA/waveglow) with the LibriTTS multi-speaker model
2) [VITS](https://github.com/jaywalnut310/vits) with the VCTK multi-speaker model

Clips were generated both with the trained speaker embeddings, and also mixtures of individual speaker embeddings to produce novel voices. See the [Synthetic Data Generation](../synthetic_data_generation.md) documentation page for more details.

The following phrases were included in the training data:

1) "hey mycroft"
2) "hey mycroft `<random words>`"

After generating the synthetic positive wakewords, they are augmented in two ways:

1) Mixing with clips from the ACAV100M dataset referenced below at ratios of 0 to 30 dB
2) Reverberated with simulated room impulse response functions from the [BIRD Impulse Response Dataset](https://github.com/FrancoisGrondin/BIRD)

## Negative Data

The model was trained on approximately ~31,000 hours of negative data, with the approximate composition shown below:

1) ~10,000 hours of noise, music, and speech from the [ACAV100M dataset](https://acav100m.github.io/)
2) ~10,000 hours from the [Common Voice 11 dataset](https://commonvoice.mozilla.org/en/datasets), representing multiple languages
3) ~10,000 hours of podcasts downloaded from the [Podcastindex database](https://podcastindex.org/)
4) ~1,000 hours of music from the [Free Music Archive dataset](https://github.com/mdeff/fma)

In addition to the above, the total negative dataset also includes reverberated versions of the ACAV100M dataset (also using the simulated room impulse responses from the [BIRD Impulse Response Dataset](https://github.com/FrancoisGrondin/BIRD) dataset). Currently, adversarial STT generations were not added to the training data for this model.

# Test Data

The positive test examples of the "hey mycroft" wakeword were collected manually in a realistic home environment from both near and far-field microphones, at distances ranging from ~3 to ~30 feet. The (male) speaker has a relatively neutral American english accent, and the recordings were captured with normal background noise included fans/air conditioning and a running dishwasher in a kitchen. A total of 51 clips were recorded in this manner.

# Performance

The false-accept/false-reject curve for the model on the test data is shown below. Decreasing the `threshold` parameter when using the model will increase the false-accept rate and decrease the false-reject rate.

![FPR/FRR curve for "hey mycroft" pre-trained model](images/hey_mycroft_performance_plot.png)

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
Binary file added docs/models/images/alexa_performance_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/models/images/hey_mycroft_performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions docs/models/timers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Model Description

A model trained to detect the presence of several different phrases all related to creating a timer or alarm for six common durations: 1 minute, 5 minutes, 10 minutes, 20 minutes, 30 minutes, and 1 hour. It is a multi-class model (e.g., each class will have a score between 0 and 1), indicating how likely a given segment of speech is to contain a phrase setting a timer/alarm for the given duration.

As with other models, similar phrases beyond those included in the training data may also work, but likely with higher false-reject rates. Similarly, a short pause after the speaking the wake phrase is recommended, but the model may also detect the presence of the wake phrase is a continuous stream of speech in certain cases.

# Training Data

## Positive Data

The model was trained on approximately ~100,000 synthetically generated clips of the timer/alarm wake phrases using two text-to-speech (TTS) models:

1) [NVIDIA WAVEGLOW](https://github.com/NVIDIA/waveglow) with the LibriTTS multi-speaker model
2) [VITS](https://github.com/jaywalnut310/vits) with the VCTK multi-speaker model

Clips were generated both with the trained speaker embeddings, and also mixtures of individual speaker embeddings to produce novel voices. See the [Synthetic Data Generation](../synthetic_data_generation.md) documentation page for more details.

The following phrases were included in the training data (where x represents the duration, and words in brackets represent possible slot insertions):

- "[create/set/start] [a/NONE] x [minutes/hour] [alarm/timer]"
- "[create/set/start] [an/a/NONE] [alarm/timer] for x [minutes/hour]"

As an example, here are several of the permutations from the structure above that were included in the training data:

- "set an alarm for 10 minutes"
- "start a 1 hour timer"
- "create timer for 5 minutes"

After generating the synthetic positive wake phrases, they are augmented in two ways:

1) Mixing with clips from the ACAV100M dataset referenced below at ratios of 0 to 30 dB
2) Reverberated with simulated room impulse response functions from the [BIRD Impulse Response Dataset](https://github.com/FrancoisGrondin/BIRD)

## Negative Data

The model was trained on approximately ~31,000 hours of negative data, with the approximate composition shown below:

1) ~10,000 hours of noise, music, and speech from the [ACAV100M dataset](https://acav100m.github.io/)
2) ~10,000 hours from the [Common Voice 11 dataset](https://commonvoice.mozilla.org/en/datasets), representing multiple languages
3) ~10,000 hours of podcasts downloaded from the [Podcastindex database](https://podcastindex.org/)
4) ~1,000 hours of music from the [Free Music Archive dataset](https://github.com/mdeff/fma)

In addition to the above, the total negative dataset also includes reverberated versions of the ACAV100M dataset (also using the simulated room impulse responses from the [BIRD Impulse Response Dataset](https://github.com/FrancoisGrondin/BIRD) dataset). Currently, adversarial STT generations were not added to the training data for this model.

# Test Data

Currently, there is not a test set available to evaluate this model.

# Performance

Due to similar training datasets and methods it is assumed to have similar performance compared to other pretrained models (e.g., <5% false-reject rates and <0.5 false-accepts per hour).

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
51 changes: 51 additions & 0 deletions docs/models/weather.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Model Description

A model trained to detect the presence of several different phrases all related to the current weather conditions. It is a binary model (e.g., scores are between 0 and 1), and only indicates of a weather-related phrase is present, not any other details about the phrase.

As with other models, similar phrases beyond those included in the training data may also work, but likely with higher false-reject rates. Similarly, a short pause after the speaking the wake phrase is recommended, but the model may also detect the presence of the wake phrase is a continuous stream of speech in certain cases.

# Training Data

## Positive Data

The model was trained on approximately ~100,000 synthetically generated clips of the weather related wake phrases using two text-to-speech (TTS) models:

1) [NVIDIA WAVEGLOW](https://github.com/NVIDIA/waveglow) with the LibriTTS multi-speaker model
2) [VITS](https://github.com/jaywalnut310/vits) with the VCTK multi-speaker model

Clips were generated both with the trained speaker embeddings, and also mixtures of individual speaker embeddings to produce novel voices. See the [Synthetic Data Generation](../synthetic_data_generation.md) documentation page for more details.

The following phrases were included in the training data:
- "what is the weather"
- "what's the weather"
- "what's today's weather"
- "tell me the weather"
- "tell me today's weather"

After generating the synthetic positive wake phrases, they are augmented in two ways:

1) Mixing with clips from the ACAV100M dataset referenced below at ratios of 0 to 30 dB
2) Reverberated with simulated room impulse response functions from the [BIRD Impulse Response Dataset](https://github.com/FrancoisGrondin/BIRD)

## Negative Data

The model was trained on approximately ~31,000 hours of negative data, with the approximate composition shown below:

1) ~10,000 hours of noise, music, and speech from the [ACAV100M dataset](https://acav100m.github.io/)
2) ~10,000 hours from the [Common Voice 11 dataset](https://commonvoice.mozilla.org/en/datasets), representing multiple languages
3) ~10,000 hours of podcasts downloaded from the [Podcastindex database](https://podcastindex.org/)
4) ~1,000 hours of music from the [Free Music Archive dataset](https://github.com/mdeff/fma)

In addition to the above, the total negative dataset also includes reverberated versions of the ACAV100M dataset (also using the simulated room impulse responses from the [BIRD Impulse Response Dataset](https://github.com/FrancoisGrondin/BIRD) dataset). Currently, adversarial STT generations were not added to the training data for this model.

# Test Data

Currently, there is not a test set available to evaluate this model.

# Performance

Due to similar training datasets and methods it is assumed to have similar performance compared to other pretrained models (e.g., <5% false-reject rates and <0.5 false-accepts per hour).

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
23 changes: 16 additions & 7 deletions docs/synthetic_data_generation.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
# Synthetic Data Generation

The use of synthetic data for training STT or wakeword/phrase detection models is not a new concept, and in particular the inspiration for this library was motivated by several specific papers:
The use of synthetic data for training STT or wakeword/phrase detection models is not a new concept, and in particular the inspiration for openWakeWord was motivated by two specific papers:

1) Paper 1 (end-to-end SLU)
2) Paper 2 (end-to-end SLU with synthetic)
1) [Speech Model Pre-training for End-to-End Spoken Language Understanding](https://arxiv.org/abs/1904.03670)
2) [Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models](https://arxiv.org/abs/1910.09463)

In general, the concept of pre-training a model on large speech datasets and then fine-tuning another smaller model on top of this (typically frozen) backbone with use-case specific data is a well-documented approach more broadly that seems to work well for many different applications.

# Choosing TTS Models

- Focus on variability in the generation (so sampling models)
- Focus on multi-speaker TTS based on speaker embeddings
During the development of openWakeWord, much effort went into identifying STT models that could produce high-quality speech to use as training data. In particular, two features are assumed to be important to produce robust wakeword models:

1) Random variability in the generated speech (in practice, models based on sampling work well)
2) Multi-speaker models

According to these criteria, the two models chosen as the foundation for openWakeWord model training are [NVIDIA WAVEGLOW](https://github.com/NVIDIA/waveglow) and [VITS](https://github.com/jaywalnut310/vits). The authors and publishers of these models deserve credit for releasing these high quality models to the community.

# Increasing Diversity in Generated Speech

- Use relatively high values for sampling parameters (even if this causes low quality generations some small percentage of the time)
- Use spherical interpolation of embeddings to generate new speakers
Beyond the inherent ability of Waveglow and VITS to produce variable speech, they both also have hyper-parameters that can be adjusted to control this effect to some extent. A forthcoming repository dedicated to dataset generation will provide more details on this, but in brief:

1) Relatively high values are used for sampling parameters (which results in more variation in the generated speech) even if this causes low quality or incorrect generations some small percentage of the time.

2) To go beyond the original number of speakers used in multi-speaker datasets, [spherical interpolation](https://en.wikipedia.org/wiki/Slerp) of speaker embeddings is used to produce mixtures of different voices to extend beyond the original training set. While this occassionaly results in lower quality generations (in particular a gravely texture to the speech), again the benefits of increased generation diversity seem to be more important for the trained openWakeWord models.
19 changes: 19 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Examples

Included are several example scripts demonstrating the usage of openWakeWord. Some of these examples have specific requirements, which are detailed below.

## Detect From Microphone

This is a simple example which allows you to test openWakeWord by using a locally connected microphone. To run the script, follow these steps:

1) Install the example-specific requirements: `pip install plotext pyaudio`

2) Run the script: `python detect_from_microphone.py`.

Note that if you have more than one microphone connected to your system, you may need to adjust the PyAudio configuration in the script to select the appropriate input device.

## Benchmark Efficiency

This is a script that estimates how many openWakeWord models could be run on on the specified number of cores for the current system. Can be useful to determine if a given system has the resources required for a particular use-case.

To run the script: `python benchmark_efficiency.py --ncores <desired integer number of cores>`
Loading

0 comments on commit bc3febc

Please sign in to comment.