Skip to content

Commit

Permalink
merge with master
Browse files Browse the repository at this point in the history
Signed-off-by: Jason <[email protected]>
  • Loading branch information
blisc committed May 18, 2020
2 parents b1df99d + 9d90a95 commit d62f021
Show file tree
Hide file tree
Showing 88 changed files with 6,823 additions and 619 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,12 @@ To release a new version, please update the changelog as followed:
- Online Data Augmentation for ASR Collection. ([PR #565](https://github.com/NVIDIA/NeMo/pull/565)) - @titu1994
- Speed augmentation on CPU, TimeStretch augmentation on CPU+GPU ([PR #594](https://github.com/NVIDIA/NeMo/pull/565)) - @titu1994
- Added TarredAudioToTextDataLayer, which allows for loading ASR datasets with tarred audio. Existing datasets can be converted with the `convert_to_tarred_audio_dataset.py` script. ([PR #602](https://github.com/NVIDIA/NeMo/pull/602))
- Online audio augmentation notebook in ASR examples ([PR #605](https://github.com/NVIDIA/NeMo/pull/605)) - @titu1994
- ContextNet Encoder + Decoder Initial Support ([PR #630](https://github.com/NVIDIA/NeMo/pull/630)) - @titu1994
- Added finetuning with Megatron-LM ([PR #601](https://github.com/NVIDIA/NeMo/pull/601)) - @ekmb

### Changed
- Syncs across workers at each step to check for NaN or inf loss. Terminates all workers if stop\_on\_nan\_loss is set (as before), lets Apex deal with it if apex.amp optimization level is O1 or higher, and skips the step across workers otherwise. ([PR #637](https://github.com/NVIDIA/NeMo/pull/637)) - @redoctopus

### Dependencies Update

Expand Down
7 changes: 7 additions & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,13 @@ pipeline {
sh 'rm -rf examples/nlp/token_classification/token_classification_output'
}
}
stage('Megatron finetuning Token Classification Training/Inference Test') {
steps {
sh 'cd examples/nlp/token_classification && CUDA_VISIBLE_DEVICES=0 python token_classification.py --data_dir /home/TestData/nlp/token_classification_punctuation/ --batch_size 2 --num_epochs 1 --save_epoch_freq 1 --work_dir megatron_output --pretrained_model_name megatron-bert-345m-uncased'
sh 'cd examples/nlp/token_classification && DATE_F=$(ls megatron_output/) && CUDA_VISIBLE_DEVICES=0 python token_classification_infer.py --checkpoint_dir megatron_output/$DATE_F/checkpoints/ --labels_dict /home/TestData/nlp/token_classification_punctuation/label_ids.csv --pretrained_model_name megatron-bert-345m-uncased'
sh 'rm -rf examples/nlp/token_classification/megatron_output'
}
}
stage ('Punctuation and Classification Training/Inference Test') {
steps {
sh 'cd examples/nlp/token_classification && CUDA_VISIBLE_DEVICES=1 python punctuation_capitalization.py --data_dir /home/TestData/nlp/token_classification_punctuation/ --work_dir punctuation_output --save_epoch_freq 1 --num_epochs 1 --save_step_freq -1 --batch_size 2'
Expand Down
31 changes: 25 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@



NVIDIA Neural Modules: NeMo
===========================
NVIDIA NeMo
===========

NeMo is a toolkit for creating `Conversational AI <https://developer.nvidia.com/conversational-ai#started>`_ applications.

Expand Down Expand Up @@ -162,14 +162,33 @@ If you prefer to use NeMo's latest development version (from GitHub) follow the
python setup.py style --fix # Tries to fix error in-place.
python setup.py style --scope=tests # Operates within certain scope (dir of file).
**Unittests**
** NeMo Test Suite**

This command runs unittests:
NeMo contains test suite divided into 5 subsets:
1) ``unit``: unit tests, i.e. testing a single, well isolated functionality
2) ``integration``: tests checking the elements when integrated into subsystems
3) ``system``: tests working at the highest integration level
4) ``acceptance``: tests checking whether the developed product/model passes the user defined acceptance criteria
5) ``docs``: tests related to documentation (deselect with '-m "not docs"')

The user can run all the tests locally by simply executing:

.. code-block:: bash
./reinstall.sh
pytest tests
pytest
In order to run a subset of tests one can use the ``-m`` argument followed by the subset name, e.g. for ``system`` subset:

.. code-block:: bash
pytest -m system
By default, all the tests will be executed on GPU. There is also an option to run the test suite on CPU
by passing the ``--cpu`` command line argument, e.g.:

.. code-block:: bash
pytest -m unit --cpu
Citation
Expand Down
12 changes: 6 additions & 6 deletions docs/docs_zh/sources/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
NVIDIA Neural Modules 开发者指南(中文版)
==========================================
NVIDIA NeMo 开发者指南
====================

.. toctree::
:hidden:
Expand All @@ -17,7 +17,7 @@ NVIDIA Neural Modules 开发者指南(中文版)



Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包,它与具体的框架无关。当前支持 PyTorch 框架。
NeMo 是一个用神经模块来构建 AI 应用的工具包,它与具体的框架无关。当前支持 PyTorch 框架。

一个“神经模块”指的是,根据一系列的输入来计算一系列输出的代码块。

Expand All @@ -27,7 +27,7 @@ Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包,


简介
-----
---

我们可以通过以下这个视频有个概览:

Expand All @@ -39,7 +39,7 @@ Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包,


核心概念和特性
------------------
-----------

* `NeuralModule` 类 - 表示以及执行一个神经模块。
* `NmTensor` - 表示的是神经模块端口之间流动的激活元。
Expand All @@ -50,7 +50,7 @@ Neural Modules (NeMo) 是一个用神经模块来构建 AI 应用的工具包,


安装依赖
----------
-------

1) Python 3.6 or 3.7
2) PyTorch >= 1.4 带GPU支持
Expand Down
8 changes: 8 additions & 0 deletions docs/sources/source/api-docs/nemo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ neural_modules
:undoc-members:
:show-inheritance:

neural_graph
--------------

.. automodule:: nemo.core.neural_graph
:members:
:undoc-members:
:show-inheritance:

neural_factory
--------------

Expand Down
36 changes: 35 additions & 1 deletion docs/sources/source/asr/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,40 @@ Perform the following steps:
python <nemo_git_repo_root>/examples/asr/jasper_eval.py --model_config=<nemo_git_repo_root>/examples/asr/configs/quartznet15x5.yaml --eval_datasets "<path_to_data>/dev_clean.json" --load_dir=<directory_containing_checkpoints> --lm_path=<path_to_6gram.binary>
Using and Converting to Tarred Datasets
---------------------------------------

If you are training on a distributed cluster, you may want to avoid a dataset consisting of many small files and instead perform batched reads from tarballs.
In this case, you can use the ``TarredAudioToTextDataLayer`` to load your data.

The ``TarredAudioToTextDataLayer`` takes in an ``audio_tar_filepaths`` argument, which specifies the path(s) to the tarballs that contain the audio files, and a ``manifest_filepath`` argument that should contain the transcripts and durations corresponding to those files (with a unique WAV basename per entry).
The ``audio_tar_filepaths`` argument can be in the form of a string, either containing a path to a single tarball or braceexpand-able to multiple paths, or a list of paths.
Note that the data layer's size (via ``len``) is set by the number of entries of the manifest, rather than the number of files across all tarballs.

This DataLayer uses `WebDataset <https://github.com/tmbdev/webdataset>`_ to read the tarred audio files.
Since reads are performed sequentially, shuffling is done with a buffer which can be specified by the argument ``shuffle_n``.

Please see the ``TarredAudioToTextDataLayer`` `documentation <https://nvidia.github.io/NeMo/collections/nemo_asr.html#nemo.collections.asr.data_layer.TarredAudioToTextDataLayer>`_ and the WebDataset documentation for more details.

.. note::

If using ``torch.distributed`` processes, the ``TarredAudioToTextDataLayer`` will automatically partition the audio tarballs across workers.
As such, if you are training on `n` workers, please make sure to divide your WAV files evenly across a number of tarballs that is divisible by `n`.

Conversion from an Existing Dataset to Tarred Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you already have an ASR dataset that you would like to convert to one that is compatible with the ``TarredAudioToTextDataLayer``, you can use ``scripts/convert_to_tarred_audio_dataset.py``.

This script takes a few arguments:

* ``manifest_path`` (required): The path to your existing dataset's manifest file.
* ``target_dir``: The directory where the tarballs and new manifest will be written. If none if given, defaults to ``./tarred``.
* ``num_shards``: The number of shards (tarballs) to create. If using multiple workers for training, set this to be a multiple of the number of workers you have. Defaults to 1.
* ``shuffle``: Setting this flag will shuffle the entries in your original manifest before creating the new dataset. You may want to do this if your original dataset is ordered, since the ``TarredAudioToTextDataLayer`` cannot shuffle the whole dataset (see ``shuffle_n``).


Kaldi Compatibility
-------------------

Expand All @@ -354,7 +388,7 @@ Of course, you will also need the .ark files that contain the audio data in the

To load your Kaldi-formatted data, you can simply use the ``KaldiFeatureDataLayer`` instead of the ``AudioToTextDataLayer``.
The ``KaldiFeatureDataLayer`` takes in an argument ``kaldi_dir`` instead of a ``manifest_filepath``, and this argument should be set to the directory that contains the files mentioned above.
See `the documentation <https://nvidia.github.io/NeMo/collections/nemo_asr.html#nemo_asr.data_layer.KaldiFeatureDataLayer>`_ for more detailed information about the arguments to this data layer.
See `the documentation <https://nvidia.github.io/NeMo/collections/nemo_asr.html#nemo.collections.asr.data_layer.KaldiFeatureDataLayer>`_ for more detailed information about the arguments to this data layer.

.. note::

Expand Down
7 changes: 4 additions & 3 deletions docs/sources/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
NVIDIA Neural Modules Developer Guide
=====================================
NVIDIA NeMo Developer Guide
===========================

.. toctree::
:hidden:
Expand All @@ -9,6 +9,7 @@ NVIDIA Neural Modules Developer Guide
tutorials/intro
training
asr/intro
speaker_recognition/intro
speech_command/intro
nlp/intro
tts/intro
Expand All @@ -17,7 +18,7 @@ NVIDIA Neural Modules Developer Guide
chinese/intro


Neural Modules (NeMo) is a framework-agnostic toolkit for building AI applications powered by Neural Modules. Current support is for PyTorch framework.
NeMo is a framework-agnostic toolkit for building AI applications powered by Neural Modules. Current support is for PyTorch framework.

A "Neural Module" is a block of code that computes a set of outputs from a set of inputs.

Expand Down
6 changes: 6 additions & 0 deletions docs/sources/source/nlp/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,12 @@ Pretraining BERT

bert_pretraining

Megatron-LM for Downstream tasks
--------------------------------
.. toctree::
:maxdepth: 8

megatron_finetuning

Transformer Language Model
--------------------------
Expand Down
7 changes: 3 additions & 4 deletions docs/sources/source/nlp/joint_intent_slot_filling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ All the code introduced in this tutorial is based on ``examples/nlp/intent_detec

There are a variety pre-trained BERT models that we can select as the base encoder for our model. We're currently
using the script for loading pre-trained models from `transformers`. \
See the list of available pre-trained models by calling `nemo.collections.nlp.nm.trainables.get_bert_models_list()`. \
See the list of available pre-trained models by calling `nemo_nlp.nm.trainables.get_pretrained_lm_models_list()`. \
The type of the encoder can get defined by the argument `--pretrained_model_name`.

.. tip::
Expand Down Expand Up @@ -95,9 +95,8 @@ Next, we define all Neural Modules participating in our joint intent slot fillin

.. code-block:: python
pretrained_bert_model = nemo_nlp.nm.trainables.get_huggingface_model(
bert_config=args.bert_config, pretrained_model_name=args.pretrained_model_name
)
bert_model = nemo_nlp.nm.trainables.get_pretrained_lm_model(
pretrained_model_name=args.pretrained_model_name)
* Create the classifier heads for our task.

Expand Down
36 changes: 36 additions & 0 deletions docs/sources/source/nlp/megatron_finetuning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Megatron-LM for Downstream Tasks
================================

Megatron :cite:`nlp-megatron-lm-shoeybi2020megatron` is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.
More details could be found in `Megatron-LM github repo <https://github.com/NVIDIA/Megatron-LM>`_.

In order to finetune a pretrained Megatron BERT language model on NLP downstream tasks from `examples/nlp <https://github.com/NVIDIA/NeMo/tree/master/examples/nlp>`_, specify the pretrained_model_name like this:

.. code-block:: bash
--pretrained_model_name megatron-bert-345m-uncased
For example, to finetune SQuAD v1.1 with Megatron-LM, run:

.. code-block:: bash
python question_answering_squad.py \
--train_file PATH_TO_DATA_DIR/squad/v1.1/train-v1.1.json \
--eval_file PATH_TO_DATA_DIR/squad/v1.1/dev-v1.1.json \
--pretrained_model_name megatron-bert-345m-uncased
If you have a different checkpoint or model configuration, use ``--pretrained_model_name megatron-bert-uncased`` or ``--pretrained_model_name megatron-bert-cased`` and specify ``--bert_config`` and ``--bert_checkpoint`` for your model.

.. note::
Megatron-LM has its own set of training arguments (including tokenizer) that are ignored during finetuning in NeMo. Please use downstream task training scripts for all NeMo supported arguments.



References
----------

.. bibliography:: nlp_all_refs.bib
:style: plain
:labelprefix: NLP-MEGATRON-LM
:keyprefix: nlp-megatron-lm-
9 changes: 6 additions & 3 deletions docs/sources/source/nlp/ner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,15 @@ First, we need to create our neural factory with the supported backend. How you
Next, we'll need to define our tokenizer and our BERT model. There are a couple of different ways you can do this. Keep in mind that NER benefits from casing ("New York City" is easier to identify than "new york city"), so we recommend you use cased models.

If you're using a standard BERT model, you should do it as follows. To see the full list of BERT model names, check out ``nemo.collections.nlp.nm.trainables.get_bert_models_list()``
If you're using a standard BERT model, you should do it as follows. To see the full list of BERT model names, check out ``nemo_nlp.nm.trainables.get_pretrained_lm_models_list()``

.. code-block:: python
tokenizer = nemo.collections.nlp.data.NemoBertTokenizer(pretrained_model="bert-base-cased")
bert_model = nemo_nlp.nm.trainables.huggingface.BERT(
bert_model = nemo_nlp.nm.trainables.get_pretrained_lm_model(
pretrained_model_name="bert-base-cased")
tokenizer = nemo.collections.nlp.data.tokenizers.get_tokenizer(
tokenizer_name="nemobert",
pretrained_model_name="bert-base-cased")
See examples/nlp/token_classification/token_classification.py on how to use a BERT model that you pre-trained yourself.
Expand Down
8 changes: 8 additions & 0 deletions docs/sources/source/nlp/nlp_all_refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,11 @@ @article{henderson2015machine
journal={research.google},
year={2015}
}

@article{shoeybi2020megatron,
title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
journal={arXiv preprint arXiv:1909.08053},
year={2020}
}

9 changes: 6 additions & 3 deletions docs/sources/source/nlp/punctuation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,15 @@ Next, we'll need to define our tokenizer and our BERT model. Currently, there ar
BERT, ALBERT and RoBERTa. These are pretrained model checkpoints from `transformers <https://huggingface.co/transformers>`__ . Apart from these, the user can also do fine-tuning
on a custom BERT checkpoint, specified by the `--bert_checkpoint` argument in the training script.
The pretrained back-bone models can be specified `--pretrained_model_name`.
See the list of available pre-trained models by calling `nemo.collections.nlp.nm.trainables.get_bert_models_list()`. \
See the list of available pre-trained models by calling `nemo_nlp.nm.trainables.get_pretrained_lm_models_list()`. \

.. code-block:: python
tokenizer = nemo.collections.nlp.data.NemoBertTokenizer(pretrained_model=PRETRAINED_BERT_MODEL)
bert_model = nemo_nlp.nm.trainables.huggingface.BERT(
bert_model = nemo_nlp.nm.trainables.get_pretrained_lm_model(
pretrained_model_name=PRETRAINED_BERT_MODEL)
tokenizer = nemo.collections.nlp.data.tokenizers.get_tokenizer(
tokenizer_name="nemobert",
pretrained_model_name=PRETRAINED_BERT_MODEL)
Now, create the train and evaluation data layers:
Expand Down
19 changes: 19 additions & 0 deletions docs/sources/source/speaker_recognition/datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Datasets
========

HI-MIA
--------

Run the script to download and process hi-mia dataset in order to generate files in the supported format of `nemo_asr`. You should set the data folder of
hi-mia using `--data_root`. These scripts are present in <nemo_root>/scripts

.. code-block:: bash
python get_hi-mia_data.py --data_root=<data directory>
After download and conversion, your `data` folder should contain directories with follwing set of files as:

* `data/<set>/train.json`
* `data/<set>/dev.json`
* `data/<set>/{set}_all.json`
* `data/<set>/utt2spk`
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. include:: ../asr/installation.rst
17 changes: 17 additions & 0 deletions docs/sources/source/speaker_recognition/intro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.. _speaker-recognition-docs:


Speaker Recognition
===================

.. toctree::
:maxdepth: 8

installation_link
tutorial
datasets
models




15 changes: 15 additions & 0 deletions docs/sources/source/speaker_recognition/models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Models
====================

.. toctree::
:maxdepth: 8

quartznet

References
----------

.. bibliography:: speaker.bib
:style: plain
:labelprefix: SPEAKER-TUT
:keyprefix: speaker-tut-
Loading

0 comments on commit d62f021

Please sign in to comment.