Skip to content

Commit

Permalink
Typo fixes in documentation [skip ci]
Browse files Browse the repository at this point in the history
  • Loading branch information
dscripka committed Mar 5, 2023
1 parent 9f19d30 commit 89b6305
Show file tree
Hide file tree
Showing 9 changed files with 34 additions and 34 deletions.
36 changes: 18 additions & 18 deletions README.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/custom_verifier_models.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Custom Verifier Models

If the performance of a trained openWakeWord model is not sufficient in a production application, training a custom verifier model on a particular speaker or set of speakers can help significantly the performance of the system. A custom verifier model acts as a filter on top of the base openWakeWord model, determining whether a given activation was likely from a known target speaker. In particular, this can be a very effective way at reducing false activiations, as the model will be more focused on a the target speaker instead of attempting to activate for any speaker.
If the performance of a trained openWakeWord model is not sufficient in a production application, training a custom verifier model on a particular speaker or set of speakers can help significantly the performance of the system. A custom verifier model acts as a filter on top of the base openWakeWord model, determining whether a given activation was likely from a known target speaker. In particular, this can be a very effective way at reducing false activations, as the model will be more focused on a the target speaker instead of attempting to activate for any speaker.

There are trade-offs to this approach, however. In general, training a custom verifier model can be beneficial with two assumptions:

Expand All @@ -16,16 +16,16 @@ Note that while the verifier model is focused on a target speaker, it is not int

# Verifier Model Training

Training a custom verifier model is conceptually simple, and only requires a very small amount of training data. Reccomendations for training data collection are listed below.
Training a custom verifier model is conceptually simple, and only requires a very small amount of training data. Recommendations for training data collection are listed below.

- Positive data (examples of wakeword or phrase)
- Collect a minimum of 3 examples for each target speaker
- Positive examples should be as close as possible to the expected deployment scenario, including some level of background noise if that is appropriate
- The capacity of the verifier model is small, it's not advised to train on a large number of positive examples or for more than a few speakers

- Negative data collection
- Collect a minimum of ~10 seconds of speech from each target speaker that does not contain the wakword, trying to include as much variation as possible in the speech
- Optionally, collect ~5 seconds clips of typical background audio in the deployment evironment or use previously collected examples of false activations (this is one of the most effective ways to reduce false activations)
- Collect a minimum of ~10 seconds of speech from each target speaker that does not contain the wakeword, trying to include as much variation as possible in the speech
- Optionally, collect ~5 seconds clips of typical background audio in the deployment environment or use previously collected examples of false activations (this is one of the most effective ways to reduce false activations)
- The capacity of the verifier model is small, it's not advised to train on a very large number of negative examples as the verifier model should be focused just on the deployment environment and user(s)

After collected the positive and negative examples, a custom verifier model can be trained with the `openwakeword.train_custom_verifier` function:
Expand Down
4 changes: 2 additions & 2 deletions docs/models/alexa.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Other similar phrases such as "Hey Alexa" or "Alexa stop" may also work, but lik

# Model Architecture

The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representive (but not exact) example of this structure is shown below.
The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representative (but not exact) example of this structure is shown below.

```
==========================================================================================
Expand Down Expand Up @@ -79,4 +79,4 @@ The false-accept/false-reject curve for the model on the test data is shown belo

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relatively clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
4 changes: 2 additions & 2 deletions docs/models/hey_jarvis.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Other similar phrases such as just "jarvis" or may also work, but likely with hi

# Model Architecture

The model is composed of two parts two 3-layer fully connected networks that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representive (but not exact) example of this structure is shown below.
The model is composed of two parts two 3-layer fully connected networks that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representative (but not exact) example of this structure is shown below.

```
==========================================================================================
Expand Down Expand Up @@ -80,4 +80,4 @@ Currently, there is not a test set available to evaluate this model.

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relatively clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
4 changes: 2 additions & 2 deletions docs/models/hey_mycroft.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Other similar phrases such as just "mycroft" or may also work, but likely with h

# Model Architecture

The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representive (but not exact) example of this structure is shown below.
The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representative (but not exact) example of this structure is shown below.

```
==========================================================================================
Expand Down Expand Up @@ -79,4 +79,4 @@ The false-accept/false-reject curve for the model on the test data is shown belo

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relatively clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
4 changes: 2 additions & 2 deletions docs/models/timers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ As with other models, similar phrases beyond those included in the training data

# Model Architecture

The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. As this model is multi-class, the final layer has the number of nodes equal to the number of classes. A softmax layer is added prior to saving the model to return scores that sum to one across the classes. A representive (but not exact) example of this structure is shown below.
The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. As this model is multi-class, the final layer has the number of nodes equal to the number of classes. A softmax layer is added prior to saving the model to return scores that sum to one across the classes. A representative (but not exact) example of this structure is shown below.

```
==========================================================================================
Expand Down Expand Up @@ -81,4 +81,4 @@ Due to similar training datasets and methods it is assumed to have similar perfo

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relatively clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
4 changes: 2 additions & 2 deletions docs/models/weather.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ As with other models, similar phrases beyond those included in the training data

# Model Architecture

The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representive (but not exact) example of this structure is shown below.
The model is a simple 3-layer full-connected network, that takes the flattened input features from the frozen audio embedding mode. ReLU activations and layer norms are inserted between the layers. A representative (but not exact) example of this structure is shown below.

```
==========================================================================================
Expand Down Expand Up @@ -80,4 +80,4 @@ Due to similar training datasets and methods it is assumed to have similar perfo

# Other Considerations

While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relativey clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
While the model was trained to be robust to background noise and reverberation, it will still perform the best when the audio is relatively clean and free of overly loud background noise. In particular, the presence of audio playback of music/speech from the same device that is capturing the microphone stream may result in significantly higher false-reject rates unless acoustic echo cancellation (AEC) is performed via hardware or software.
2 changes: 1 addition & 1 deletion docs/synthetic_data_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ Beyond the inherent ability of Waveglow and VITS to produce variable speech, the

1) Relatively high values are used for sampling parameters (which results in more variation in the generated speech) even if this causes low quality or incorrect generations some small percentage of the time.

2) To go beyond the original number of speakers used in multi-speaker datasets, [spherical interpolation](https://en.wikipedia.org/wiki/Slerp) of speaker embeddings is used to produce mixtures of different voices to extend beyond the original training set. While this occassionaly results in lower quality generations (in particular a gravely texture to the speech), again the benefits of increased generation diversity seem to be more important for the trained openWakeWord models.
2) To go beyond the original number of speakers used in multi-speaker datasets, [spherical interpolation](https://en.wikipedia.org/wiki/Slerp) of speaker embeddings is used to produce mixtures of different voices to extend beyond the original training set. While this occasionally results in lower quality generations (in particular a gravely texture to the speech), again the benefits of increased generation diversity seem to be more important for the trained openWakeWord models.
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Note that if you have more than one microphone connected to your system, you may

## Capture Activations

This script is designed to run silently in the background and capture activations for the inlcluded pre-trained models. You can specify the initialization arguments, activation threshold, and output directory for the saved audio files for each activation. To run the script, follow these steps:
This script is designed to run silently in the background and capture activations for the included pre-trained models. You can specify the initialization arguments, activation threshold, and output directory for the saved audio files for each activation. To run the script, follow these steps:

1) Install the example-specific requirements:

Expand Down

0 comments on commit 89b6305

Please sign in to comment.