-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of github.com:NVIDIA/NeMo into dev-cv-image-cla…
…ssification
- Loading branch information
Showing
133 changed files
with
8,189 additions
and
256 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
8kHz Models | ||
=========== | ||
|
||
For applications based on telephony speech, using models trained on narrowband audio data sampled at 8 kHz may perform better than using models built with | ||
audio at a higher frequency (Note that to use models with audio at a different sample rate from your data, you would need to resample your data to match the sampling rate in the | ||
config file of the model). One approach to create large datasets for training a model suitable for your application would be to convert all audio data | ||
to the formats prevalent in your application. Here we detail one such approach that we took to train a model based on 8 kHz data. | ||
|
||
To train a model suitable for recognizing telephony speech we converted some of the datasets to G.711 :cite:`8kHz-mod-itu1988g711`. G.711 is a popular speech codec used in VoIP products and encodes speech | ||
at 64 kbps using PCM u-law companding. We converted audio from LibriSpeech, Mozilla Common Voice and WSJ datasets to G.711 format and combined Fisher and Switchboard datasets to | ||
train a :ref:`Quartznet15x5 <Quartznet_model>` model with about 4000 hours of data. To convert your audio to G.711 format you can use the script `convert_wav_to_g711wav.py` found in the `scripts` sub-directory of the nemo base directory. | ||
|
||
Among the experiments that we ran, we got the best accuracy for a model that used our 16 kHz Quartznet15x5 model's weights as pre-trained weights. We then | ||
trained the model for 250 epochs with five datasets mentioned above. Here are some results for our best model so far (note that all the test sets | ||
were converted to G.711 format for the results below): | ||
|
||
====================== ===================== | ||
Test set WER (%) | ||
====================== ===================== | ||
LibriSpeech dev-clean 4.35 | ||
LibriSpeech dev-other 11.89 | ||
LibriSpeech test-clean 4.45 | ||
LibriSpeech test-other 12.02 | ||
Switchboard test 10.74 | ||
Switchboard dev 10.59 | ||
====================== ===================== | ||
|
||
The model was first pretrained with 8 kHz LibriSpeech data for 134 epochs and then trained for another 250 epochs using G.711 audio from all the five datasets listed above. For best accuracy | ||
in your application, you may choose to :ref:`fine-tune <fine-tune>` this model using data collected from your application. | ||
|
||
.. | ||
The pre-trained model is available for download `here <https://ngc.nvidia.com/models/nvidian:nemo:quartznet_15x5_8_khz_for_nemo>`_. | ||
References | ||
---------- | ||
.. bibliography:: asr_all.bib | ||
:style: plain | ||
:labelprefix: 8kHz-mod | ||
:keyprefix: 8kHz-mod- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,8 @@ Speech Recognition | |
tutorial | ||
datasets | ||
models | ||
8kHz_models | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,3 +12,4 @@ Getting started | |
weightsharing | ||
callbacks | ||
complex_training | ||
neural_graphs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
Neural Graphs | ||
============= | ||
|
||
The Neural Graph is a high-level abstract concept empowering the user to build graphs consisting of many, | ||
interconnected Neural Modules. | ||
Once the user defines a graph, its topology is “frozen”, i.e. connections between modules cannot change. | ||
If a user wants to change the topology - he/she can build another graph, potentially spanned over the same modules. | ||
At the same time, he can reuse and nest one graph into another. | ||
|
||
|
||
.. figure:: neural_graphs_general.png | ||
|
||
The import/export/save/restore options combined with the lightweight API make Neural Graphs | ||
a perfect tool for rapid prototyping and experimentation. | ||
|
||
There are two Jupyter Notebook tutorials focusing on different aspects of the Neural Graphs functionality. | ||
|
||
Tutorial I: The basic functionality | ||
----------------------------------- | ||
|
||
In this first part of the Neural Graphs (NGs) tutorial we will focus on a simple example: | ||
training TaylorNet module to approximate a sine wave function. | ||
We will build a simple "model graph" and show how we can nest it into another graphs. | ||
|
||
|
||
.. figure:: neural_graphs_nesting.png | ||
|
||
This part covers the following: | ||
* how to create a Neural Graph object | ||
* how to activate/deactivate graph context (in various ways) | ||
* how to bind NG inputs and outpus (in various ways) | ||
* how to nest one graph (representing the our "trainable model") into training and validation graphs | ||
|
||
|
||
Tutorial II: The advanced functionality | ||
--------------------------------------- | ||
|
||
In this first part of the Neural Graphs (NGs) tutorial we will focus on a more complex example: | ||
training of an End-to-End Convolutional Neural Acoustic Model called JASPER. | ||
We will build a "model graph" and show how we can nest it into another graphs, how we can freeze/unfreeze modules, | ||
use graph configuration and save/load graph checkpoints. | ||
|
||
This part covers the following: | ||
* how to nest one graph into another | ||
* how to serialize and deserialize a graph | ||
* how to export and import serialized graph configuration to/from YAML files | ||
* how to save and load graph checkpoints (containing weights of the Trainable NMs) | ||
* how to freeze/unfreeze modules in a graph | ||
|
||
Additionally, we will show how use `AppState` to list all the modules and graphs we have created in the scope of | ||
our application. | ||
|
||
.. note:: | ||
Both tutorial notebooks can be found in the `nemo/examples/neural_graphs` folder. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.