merge with master

Signed-off-by: Jason <[email protected]>
NVIDIA · May 18, 2020 · d62f021 · d62f021
2 parents b1df99d + 9d90a95
commit d62f021
Show file tree

Hide file tree

Showing 88 changed files with 6,823 additions and 619 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -75,8 +75,12 @@ To release a new version, please update the changelog as followed:
 - Online Data Augmentation for ASR Collection. ([PR #565](https://github.com/NVIDIA/NeMo/pull/565)) - @titu1994
 - Speed augmentation on CPU, TimeStretch augmentation on CPU+GPU ([PR #594](https://github.com/NVIDIA/NeMo/pull/565)) - @titu1994
 - Added TarredAudioToTextDataLayer, which allows for loading ASR datasets with tarred audio. Existing datasets can be converted with the `convert_to_tarred_audio_dataset.py` script. ([PR #602](https://github.com/NVIDIA/NeMo/pull/602))
+- Online audio augmentation notebook in ASR examples ([PR #605](https://github.com/NVIDIA/NeMo/pull/605)) - @titu1994
+- ContextNet Encoder + Decoder Initial Support ([PR #630](https://github.com/NVIDIA/NeMo/pull/630)) - @titu1994
+- Added finetuning with Megatron-LM ([PR #601](https://github.com/NVIDIA/NeMo/pull/601)) - @ekmb
 
 ### Changed
+- Syncs across workers at each step to check for NaN or inf loss. Terminates all workers if stop\_on\_nan\_loss is set (as before), lets Apex deal with it if apex.amp optimization level is O1 or higher, and skips the step across workers otherwise. ([PR #637](https://github.com/NVIDIA/NeMo/pull/637)) - @redoctopus
 
 ### Dependencies Update
 

diff --git a/Jenkinsfile b/Jenkinsfile
@@ -184,6 +184,13 @@ pipeline {
             sh 'rm -rf examples/nlp/token_classification/token_classification_output'
           }
         }
+        stage('Megatron finetuning Token Classification Training/Inference Test') {
+          steps {
+            sh 'cd examples/nlp/token_classification && CUDA_VISIBLE_DEVICES=0 python token_classification.py --data_dir /home/TestData/nlp/token_classification_punctuation/ --batch_size 2 --num_epochs 1 --save_epoch_freq 1 --work_dir megatron_output --pretrained_model_name megatron-bert-345m-uncased'
+            sh 'cd examples/nlp/token_classification && DATE_F=$(ls megatron_output/) && CUDA_VISIBLE_DEVICES=0 python token_classification_infer.py --checkpoint_dir megatron_output/$DATE_F/checkpoints/ --labels_dict /home/TestData/nlp/token_classification_punctuation/label_ids.csv --pretrained_model_name megatron-bert-345m-uncased'
+            sh 'rm -rf examples/nlp/token_classification/megatron_output'
+          }
+        }
         stage ('Punctuation and Classification Training/Inference Test') {
           steps {
             sh 'cd examples/nlp/token_classification && CUDA_VISIBLE_DEVICES=1 python punctuation_capitalization.py --data_dir /home/TestData/nlp/token_classification_punctuation/ --work_dir punctuation_output --save_epoch_freq 1 --num_epochs 1 --save_step_freq -1 --batch_size 2'

diff --git a/README.rst b/README.rst
@@ -24,8 +24,8 @@
 
 
 
-NVIDIA Neural Modules: NeMo
-===========================
+NVIDIA NeMo
+===========
 
 NeMo is a toolkit for creating `Conversational AI <https://developer.nvidia.com/conversational-ai#started>`_ applications.
 
@@ -162,14 +162,33 @@ If you prefer to use NeMo's latest development version (from GitHub) follow the
     python setup.py style --fix  # Tries to fix error in-place.
     python setup.py style --scope=tests  # Operates within certain scope (dir of file).
 
-**Unittests**
+** NeMo Test Suite**
 
-This command runs unittests:
+NeMo contains test suite divided into 5 subsets:
+ 1) ``unit``: unit tests, i.e. testing a single, well isolated functionality
+ 2) ``integration``: tests checking the elements when integrated into subsystems
+ 3) ``system``: tests working at the highest integration level
+ 4) ``acceptance``: tests checking whether the developed product/model passes the user defined acceptance criteria
+ 5) ``docs``: tests related to documentation (deselect with '-m "not docs"')
+
+The user can run  all the tests locally by simply executing:
 
 .. code-block:: bash
 
-    ./reinstall.sh
-    pytest tests
+    pytest
+
+In order to run a subset of tests one can use the ``-m`` argument followed by the subset name, e.g. for ``system`` subset:
+
+.. code-block:: bash
+
+    pytest -m system
+
+By default, all the tests will be executed on GPU. There is also an option to run the test suite on CPU
+by passing the ``--cpu`` command line argument, e.g.:
+
+.. code-block:: bash
+
+    pytest -m unit --cpu
 
 
 Citation

diff --git a/docs/docs_zh/sources/source/index.rst b/docs/docs_zh/sources/source/index.rst
@@ -1,5 +1,5 @@
-NVIDIA Neural Modules 开发者指南(中文版)
-==========================================
+NVIDIA NeMo 开发者指南
+====================
 
 .. toctree::
    :hidden:
@@ -17,7 +17,7 @@ NVIDIA Neural Modules 开发者指南(中文版)
 
 
 
-Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包，它与具体的框架无关。当前支持 PyTorch 框架。
+NeMo 是一个用神经模块来构建 AI 应用的工具包，它与具体的框架无关。当前支持 PyTorch 框架。
 
 一个“神经模块”指的是，根据一系列的输入来计算一系列输出的代码块。
 
@@ -27,7 +27,7 @@ Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包，
 
 
 简介
------
+---
 
 我们可以通过以下这个视频有个概览：
 
@@ -39,7 +39,7 @@ Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包，
 
 
 核心概念和特性
-------------------
+-----------
 
 * `NeuralModule` 类 - 表示以及执行一个神经模块。
 * `NmTensor` - 表示的是神经模块端口之间流动的激活元。
@@ -50,7 +50,7 @@ Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包，
 
 
 安装依赖
-----------
+-------
 
 1) Python 3.6 or 3.7
 2) PyTorch >= 1.4 带GPU支持

diff --git a/docs/sources/source/api-docs/nemo.rst b/docs/sources/source/api-docs/nemo.rst
@@ -18,6 +18,14 @@ neural_modules
     :undoc-members:
     :show-inheritance:
 
+neural_graph
+--------------
+
+.. automodule:: nemo.core.neural_graph
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
 neural_factory
 --------------
 

diff --git a/docs/sources/source/asr/tutorial.rst b/docs/sources/source/asr/tutorial.rst
@@ -340,6 +340,40 @@ Perform the following steps:
 
         python <nemo_git_repo_root>/examples/asr/jasper_eval.py --model_config=<nemo_git_repo_root>/examples/asr/configs/quartznet15x5.yaml --eval_datasets "<path_to_data>/dev_clean.json" --load_dir=<directory_containing_checkpoints> --lm_path=<path_to_6gram.binary>
 
+
+Using and Converting to Tarred Datasets
+---------------------------------------
+
+If you are training on a distributed cluster, you may want to avoid a dataset consisting of many small files and instead perform batched reads from tarballs.
+In this case, you can use the ``TarredAudioToTextDataLayer`` to load your data.
+
+The ``TarredAudioToTextDataLayer`` takes in an ``audio_tar_filepaths`` argument, which specifies the path(s) to the tarballs that contain the audio files, and a ``manifest_filepath`` argument that should contain the transcripts and durations corresponding to those files (with a unique WAV basename per entry).
+The ``audio_tar_filepaths`` argument can be in the form of a string, either containing a path to a single tarball or braceexpand-able to multiple paths, or a list of paths.
+Note that the data layer's size (via ``len``) is set by the number of entries of the manifest, rather than the number of files across all tarballs.
+
+This DataLayer uses `WebDataset <https://github.com/tmbdev/webdataset>`_ to read the tarred audio files.
+Since reads are performed sequentially, shuffling is done with a buffer which can be specified by the argument ``shuffle_n``.
+
+Please see the ``TarredAudioToTextDataLayer`` `documentation <https://nvidia.github.io/NeMo/collections/nemo_asr.html#nemo.collections.asr.data_layer.TarredAudioToTextDataLayer>`_ and the WebDataset documentation for more details.
+
+.. note::
+
+  If using ``torch.distributed`` processes, the ``TarredAudioToTextDataLayer`` will automatically partition the audio tarballs across workers.
+  As such, if you are training on `n` workers, please make sure to divide your WAV files evenly across a number of tarballs that is divisible by `n`.
+
+Conversion from an Existing Dataset to Tarred Dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you already have an ASR dataset that you would like to convert to one that is compatible with the ``TarredAudioToTextDataLayer``, you can use ``scripts/convert_to_tarred_audio_dataset.py``.
+
+This script takes a few arguments:
+
+* ``manifest_path`` (required): The path to your existing dataset's manifest file.
+* ``target_dir``: The directory where the tarballs and new manifest will be written. If none if given, defaults to ``./tarred``.
+* ``num_shards``: The number of shards (tarballs) to create. If using multiple workers for training, set this to be a multiple of the number of workers you have. Defaults to 1.
+* ``shuffle``: Setting this flag will shuffle the entries in your original manifest before creating the new dataset. You may want to do this if your original dataset is ordered, since the ``TarredAudioToTextDataLayer`` cannot shuffle the whole dataset (see ``shuffle_n``).
+
+
 Kaldi Compatibility
 -------------------
 
@@ -354,7 +388,7 @@ Of course, you will also need the .ark files that contain the audio data in the
 
 To load your Kaldi-formatted data, you can simply use the ``KaldiFeatureDataLayer`` instead of the ``AudioToTextDataLayer``.
 The ``KaldiFeatureDataLayer`` takes in an argument ``kaldi_dir`` instead of a ``manifest_filepath``, and this argument should be set to the directory that contains the files mentioned above.
-See `the documentation <https://nvidia.github.io/NeMo/collections/nemo_asr.html#nemo_asr.data_layer.KaldiFeatureDataLayer>`_ for more detailed information about the arguments to this data layer.
+See `the documentation <https://nvidia.github.io/NeMo/collections/nemo_asr.html#nemo.collections.asr.data_layer.KaldiFeatureDataLayer>`_ for more detailed information about the arguments to this data layer.
 
 .. note::
 

diff --git a/docs/sources/source/index.rst b/docs/sources/source/index.rst
@@ -1,5 +1,5 @@
-NVIDIA Neural Modules Developer Guide
-=====================================
+NVIDIA NeMo Developer Guide
+===========================
 
 .. toctree::
    :hidden:
@@ -9,6 +9,7 @@ NVIDIA Neural Modules Developer Guide
    tutorials/intro
    training
    asr/intro
+   speaker_recognition/intro
    speech_command/intro
    nlp/intro
    tts/intro
@@ -17,7 +18,7 @@ NVIDIA Neural Modules Developer Guide
    chinese/intro
 
 
-Neural Modules (NeMo) is a framework-agnostic toolkit for building AI applications powered by Neural Modules. Current support is for PyTorch framework.
+NeMo is a framework-agnostic toolkit for building AI applications powered by Neural Modules. Current support is for PyTorch framework.
 
 A "Neural Module" is a block of code that computes a set of outputs from a set of inputs.
 

diff --git a/docs/sources/source/nlp/intro.rst b/docs/sources/source/nlp/intro.rst
@@ -32,6 +32,12 @@ Pretraining BERT
 
    bert_pretraining
 
+Megatron-LM for Downstream tasks
+--------------------------------
+.. toctree::
+   :maxdepth: 8
+
+   megatron_finetuning
 
 Transformer Language Model
 --------------------------

diff --git a/docs/sources/source/nlp/joint_intent_slot_filling.rst b/docs/sources/source/nlp/joint_intent_slot_filling.rst
@@ -7,7 +7,7 @@ All the code introduced in this tutorial is based on ``examples/nlp/intent_detec
 
 There are a variety pre-trained BERT models that we can select as the base encoder for our model. We're currently
 using the script for loading pre-trained models from `transformers`. \
-See the list of available pre-trained models by calling `nemo.collections.nlp.nm.trainables.get_bert_models_list()`. \
+See the list of available pre-trained models by calling `nemo_nlp.nm.trainables.get_pretrained_lm_models_list()`. \
 The type of the encoder can get defined by the argument `--pretrained_model_name`.
 
 .. tip::
@@ -95,9 +95,8 @@ Next, we define all Neural Modules participating in our joint intent slot fillin
 
     .. code-block:: python
 
-        pretrained_bert_model = nemo_nlp.nm.trainables.get_huggingface_model(
-            bert_config=args.bert_config, pretrained_model_name=args.pretrained_model_name
-        )
+        bert_model = nemo_nlp.nm.trainables.get_pretrained_lm_model(
+            pretrained_model_name=args.pretrained_model_name)
 
     * Create the classifier heads for our task.
 

diff --git a/docs/sources/source/nlp/megatron_finetuning.rst b/docs/sources/source/nlp/megatron_finetuning.rst
@@ -0,0 +1,36 @@
+Megatron-LM for Downstream Tasks
+================================
+
+Megatron :cite:`nlp-megatron-lm-shoeybi2020megatron` is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.
+More details could be found in `Megatron-LM github repo <https://github.com/NVIDIA/Megatron-LM>`_.
+
+In order to finetune a pretrained Megatron BERT language model on NLP downstream tasks from `examples/nlp  <https://github.com/NVIDIA/NeMo/tree/master/examples/nlp>`_, specify the pretrained_model_name like this: 
+
+.. code-block:: bash
+
+    --pretrained_model_name megatron-bert-345m-uncased
+
+For example, to finetune SQuAD v1.1 with Megatron-LM, run:
+
+.. code-block:: bash
+
+    python question_answering_squad.py  \
+    --train_file PATH_TO_DATA_DIR/squad/v1.1/train-v1.1.json  \
+    --eval_file PATH_TO_DATA_DIR/squad/v1.1/dev-v1.1.json \
+    --pretrained_model_name megatron-bert-345m-uncased
+
+
+If you have a different checkpoint or model configuration, use ``--pretrained_model_name megatron-bert-uncased`` or ``--pretrained_model_name megatron-bert-cased`` and specify ``--bert_config`` and ``--bert_checkpoint`` for your model.
+
+.. note::
+    Megatron-LM has its own set of training arguments (including tokenizer) that are ignored during finetuning in NeMo. Please use downstream task training scripts for all NeMo supported arguments.
+
+
+
+References
+----------
+
+.. bibliography:: nlp_all_refs.bib
+    :style: plain
+    :labelprefix: NLP-MEGATRON-LM
+    :keyprefix: nlp-megatron-lm-
diff --git a/docs/sources/source/nlp/ner.rst b/docs/sources/source/nlp/ner.rst
@@ -67,12 +67,15 @@ First, we need to create our neural factory with the supported backend. How you
 
 Next, we'll need to define our tokenizer and our BERT model. There are a couple of different ways you can do this. Keep in mind that NER benefits from casing ("New York City" is easier to identify than "new york city"), so we recommend you use cased models.
 
-If you're using a standard BERT model, you should do it as follows. To see the full list of BERT model names, check out ``nemo.collections.nlp.nm.trainables.get_bert_models_list()``
+If you're using a standard BERT model, you should do it as follows. To see the full list of BERT model names, check out ``nemo_nlp.nm.trainables.get_pretrained_lm_models_list()``
 
     .. code-block:: python
 
-        tokenizer = nemo.collections.nlp.data.NemoBertTokenizer(pretrained_model="bert-base-cased")
-        bert_model = nemo_nlp.nm.trainables.huggingface.BERT(
+        bert_model = nemo_nlp.nm.trainables.get_pretrained_lm_model(
+            pretrained_model_name="bert-base-cased")
+
+        tokenizer = nemo.collections.nlp.data.tokenizers.get_tokenizer(
+            tokenizer_name="nemobert",
             pretrained_model_name="bert-base-cased")
 
 See examples/nlp/token_classification/token_classification.py on how to use a BERT model that you pre-trained yourself.

diff --git a/docs/sources/source/nlp/nlp_all_refs.bib b/docs/sources/source/nlp/nlp_all_refs.bib
@@ -161,3 +161,11 @@ @article{henderson2015machine
   journal={research.google},
   year={2015}
 }
+
+@article{shoeybi2020megatron,
+  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
+  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
+  journal={arXiv preprint arXiv:1909.08053},
+  year={2020}
+}
+
diff --git a/docs/sources/source/nlp/punctuation.rst b/docs/sources/source/nlp/punctuation.rst
@@ -102,12 +102,15 @@ Next, we'll need to define our tokenizer and our BERT model. Currently, there ar
 BERT, ALBERT and RoBERTa. These are pretrained model checkpoints from `transformers <https://huggingface.co/transformers>`__ . Apart from these, the user can also do fine-tuning
 on a custom BERT checkpoint, specified by the `--bert_checkpoint` argument in the training script.
 The pretrained back-bone models can be specified `--pretrained_model_name`.
-See the list of available pre-trained models by calling `nemo.collections.nlp.nm.trainables.get_bert_models_list()`. \
+See the list of available pre-trained models by calling `nemo_nlp.nm.trainables.get_pretrained_lm_models_list()`. \
 
     .. code-block:: python
 
-        tokenizer = nemo.collections.nlp.data.NemoBertTokenizer(pretrained_model=PRETRAINED_BERT_MODEL)
-        bert_model = nemo_nlp.nm.trainables.huggingface.BERT(
+        bert_model = nemo_nlp.nm.trainables.get_pretrained_lm_model(
+            pretrained_model_name=PRETRAINED_BERT_MODEL)
+
+        tokenizer = nemo.collections.nlp.data.tokenizers.get_tokenizer(
+            tokenizer_name="nemobert",
             pretrained_model_name=PRETRAINED_BERT_MODEL)
 
 Now, create the train and evaluation data layers:

diff --git a/docs/sources/source/speaker_recognition/datasets.rst b/docs/sources/source/speaker_recognition/datasets.rst
@@ -0,0 +1,19 @@
+Datasets
+========
+
+HI-MIA
+--------
+
+Run the script to download and process hi-mia dataset in order to generate files in the supported format of  `nemo_asr`. You should set the data folder of 
+hi-mia using `--data_root`. These scripts are present in <nemo_root>/scripts
+
+.. code-block:: bash
+
+    python get_hi-mia_data.py --data_root=<data directory> 
+
+After download and conversion, your `data` folder should contain directories with follwing set of files as:
+
+* `data/<set>/train.json`
+* `data/<set>/dev.json` 
+* `data/<set>/{set}_all.json` 
+* `data/<set>/utt2spk`
diff --git a/docs/sources/source/speaker_recognition/installation_link.rst b/docs/sources/source/speaker_recognition/installation_link.rst
@@ -0,0 +1 @@
+.. include:: ../asr/installation.rst
diff --git a/docs/sources/source/speaker_recognition/intro.rst b/docs/sources/source/speaker_recognition/intro.rst
@@ -0,0 +1,17 @@
+.. _speaker-recognition-docs:
+
+
+Speaker Recognition
+===================
+
+.. toctree::
+   :maxdepth: 8
+
+   installation_link
+   tutorial
+   datasets
+   models
+
+
+
+
diff --git a/docs/sources/source/speaker_recognition/models.rst b/docs/sources/source/speaker_recognition/models.rst
@@ -0,0 +1,15 @@
+Models
+====================
+
+.. toctree::
+   :maxdepth: 8
+
+   quartznet
+
+References
+----------
+
+.. bibliography:: speaker.bib
+   :style: plain
+   :labelprefix: SPEAKER-TUT
+   :keyprefix: speaker-tut-