Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/NVIDIA/NeMo into jenkins_…
Browse files Browse the repository at this point in the history
…test
  • Loading branch information
ekmb committed Feb 13, 2020
2 parents f8bd75a + f072029 commit 880c730
Show file tree
Hide file tree
Showing 120 changed files with 3,029 additions and 3,700 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,12 +70,16 @@ To release a new version, please update the changelog as followed:
## [Unreleased]

### Added
- New Neural Type System and its tests.
([PR #307](https://github.com/NVIDIA/NeMo/pull/307)) - @okuchaiev
- Named tensors tuple module's output for graph construction.
([PR #268](https://github.com/NVIDIA/NeMo/pull/268)) - @stasbel
- Introduced the `deprecated` decorator.
([PR #298](https://github.com/NVIDIA/NeMo/pull/298)) - @tkornuta-nvidia

### Changed
- All collections changed to use New Neural Type System.
([PR #307](https://github.com/NVIDIA/NeMo/pull/307)) - @okuchaiev
- Additional Collections Repositories merged into core `nemo_toolkit` package.
([PR #289](https://github.com/NVIDIA/NeMo/pull/289)) - @DEKHTIARJonathan
- Refactor manifest files parsing and processing for re-using.
Expand Down
188 changes: 131 additions & 57 deletions Jenkinsfile

Large diffs are not rendered by default.

127 changes: 80 additions & 47 deletions docs/docs_zh/sources/source/nlp/bert_pretraining.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ BERT预训练

创建一个专门领域的BERT模型对于某些应用是更有优势的。比如一个专门针对生物医学领域的专业BERT,类似于BioBERT :cite:`nlp-bert-lee2019biobert` 和SciBERT :cite:`nlp-bert-beltagy2019scibert` 。

本教程中所使用的代码来自于 ``examples/nlp/bert_pretraining.py``.
本教程中所使用的代码来自于 ``examples/nlp/language_modeling/bert_pretraining.py``.

语料下载
--------
Expand Down Expand Up @@ -51,10 +51,20 @@ BERT预训练
# If you're using a custom vocabulary, create your tokenizer like this
tokenizer = SentencePieceTokenizer(model_path="tokenizer.model")
tokenizer.add_special_tokens(["[MASK]", "[CLS]", "[SEP]"])
special_tokens = {
"sep_token": "[SEP]",
"pad_token": "[PAD]",
"bos_token": "[CLS]",
"mask_token": "[MASK]",
"eos_token": "[SEP]",
"cls_token": "[CLS]",
}
tokenizer.add_special_tokens(special_tokens)
# Otherwise, create your tokenizer like this
tokenizer = NemoBertTokenizer(vocab_file="vocab.txt")
# or
tokenizer = NemoBertTokenizer(pretrained_model="bert-base-uncased")
创建模型
--------
Expand All @@ -78,76 +88,99 @@ BERT预训练

.. code-block:: python
bert_model = nemo_nlp.huggingface.BERT(
vocab_size=tokenizer.vocab_size,
num_layers=args.num_layers,
d_model=args.d_model,
num_heads=args.num_heads,
d_inner=args.d_inner,
max_seq_length=args.max_seq_length,
hidden_act="gelu")
bert_model = nemo_nlp.nm.trainables.huggingface.BERT(
vocab_size=args.vocab_size,
num_hidden_layers=args.num_hidden_layers,
hidden_size=args.hidden_size,
num_attention_heads=args.num_attention_heads,
intermediate_size=args.intermediate_size,
max_position_embeddings=args.max_seq_length,
hidden_act=args.hidden_act)
如果你想从一个已有的BERT模型文件继续训练,那设置一个模型的名字即可。如果想查看完整的预训练好的BERT模型列表,可以使用 `nemo_nlp.huggingface.BERT.list_pretrained_models()` 。

.. code-block:: python
bert_model = nemo_nlp.huggingface.BERT(pretrained_model_name="bert-base-cased")
bert_model = nemo_nlp.nm.trainables.huggingface.BERT(pretrained_model_name="bert-base-cased")
接下来,我们需要定义分类器和损失函数。在本教程中,我们会同时使用掩码语言模型和预测下一句模型这两个模型的损失函数,如果你只用掩饰语言模型作为损失的话,可能会观察到更高的准确率。

.. code-block:: python
mlm_classifier = nemo_nlp.TokenClassifier(args.d_model,
mlm_classifier = nemo_nlp.nm.trainables.TokenClassifier(args.d_model,
num_classes=tokenizer.vocab_size,
num_layers=1,
log_softmax=True)
mlm_loss_fn = nemo_nlp.MaskedLanguageModelingLossNM()
mlm_loss_fn = nemo_nlp.nm.losses.MaskedLanguageModelingLossNM()
nsp_classifier = nemo_nlp.SequenceClassifier(args.d_model,
nsp_classifier = nemo_nlp.nm.trainables.SequenceClassifier(args.d_model,
num_classes=2,
num_layers=2,
log_softmax=True)
nsp_loss_fn = nemo.backends.pytorch.common.CrossEntropyLoss()
bert_loss = nemo_nlp.LossAggregatorNM(num_inputs=2)
bert_loss = nemo_nlp.nm.losses.LossAggregatorNM(num_inputs=2)
然后,我们把从输入到输出的整个计算流程封装成一个函数。有了这个函数,我们就可以很方便的分别创建训练流和评估流:

.. code-block:: python
def create_pipeline(**args):
dataset = nemo_nlp.BertPretrainingDataset(**params)
data_layer = nemo_nlp.BertPretrainingDataLayer(dataset)
steps_per_epoch = len(data_layer) // (batch_size * args.num_gpus)
input_ids, input_type_ids, input_mask, \
output_ids, output_mask, nsp_labels = data_layer()
hidden_states = bert_model(input_ids=input_ids,
token_type_ids=input_type_ids,
attention_mask=input_mask)
mlm_logits = mlm_classifier(hidden_states=hidden_states)
mlm_loss = mlm_loss_fn(logits=mlm_logits,
output_ids=output_ids,
output_mask=output_mask)
nsp_logits = nsp_classifier(hidden_states=hidden_states)
nsp_loss = nsp_loss_fn(logits=nsp_logits, labels=nsp_labels)
loss = bert_loss(loss_1=mlm_loss, loss_2=nsp_loss)
return loss, [mlm_loss, nsp_loss], steps_per_epoch
train_loss, _, steps_per_epoch = create_pipeline(data_desc.train_file,
args.max_seq_length,
args.mask_probability,
args.batch_size)
eval_loss, eval_tensors, _ = create_pipeline(data_desc.eval_file,
args.max_seq_length,
args.mask_probability,
args.eval_batch_size)
data_layer = nemo_nlp.nm.data_layers.BertPretrainingDataLayer(
tokenizer,
data_file,
max_seq_length,
mask_probability,
short_seq_prob,
batch_size)
# for preprocessed data
# data_layer = nemo_nlp.BertPretrainingPreprocessedDataLayer(
# data_file,
# max_predictions_per_seq,
# batch_size, is_training)
steps_per_epoch = len(data_layer) // (batch_size * args.num_gpus * args.batches_per_step)
input_data = data_layer()
hidden_states = bert_model(input_ids=input_data.input_ids,
token_type_ids=input_data.input_type_ids,
attention_mask=input_data.input_mask)
mlm_logits = mlm_classifier(hidden_states=hidden_states)
mlm_loss = mlm_loss_fn(logits=mlm_logits,
output_ids=input_data.output_ids,
output_mask=input_data.output_mask)
nsp_logits = nsp_classifier(hidden_states=hidden_states)
nsp_loss = nsp_loss_fn(logits=nsp_logits, labels=input_data.labels)
loss = bert_loss(loss_1=mlm_loss, loss_2=nsp_loss)
return loss, mlm_loss, nsp_loss, steps_per_epoch
train_loss, _, _, steps_per_epoch = create_pipeline(
data_file=data_desc.train_file,
preprocessed_data=False,
max_seq_length=args.max_seq_length,
mask_probability=args.mask_probability,
short_seq_prob=args.short_seq_prob,
batch_size=args.batch_size,
batches_per_step=args.batches_per_step)
# for preprocessed data
# train_loss, _, _, steps_per_epoch = create_pipeline(
# data_file=args.data_dir,
# preprocessed_data=True,
# max_predictions_per_seq=args.max_predictions_per_seq,
# training=True,
# batch_size=args.batch_size,
# batches_per_step=args.batches_per_step)
eval_loss, eval_tensors, _ = create_pipeline(data_desc.eval_file,
args.max_seq_length,
Expand Down
13 changes: 6 additions & 7 deletions docs/sources/source/nlp/asr-improvement.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,17 @@ Then we define tokenizer to convert tokens into indices. We will use ``bert-base

.. code-block:: python
tokenizer = NemoBertTokenizer(pretrained_model="bert-base-uncased")
tokenizer = nemo_nlp.data.NemoBertTokenizer(pretrained_model="bert-base-uncased")
The encoder block is a neural module corresponding to BERT language model from
``nemo_nlp.huggingface`` collection:
``nemo_nlp.nm.trainables.huggingface`` collection:

.. code-block:: python
zeros_transform = nemo.backends.pytorch.common.ZerosLikeNM()
encoder = nemo_nlp.huggingface.BERT(
pretrained_model_name=args.pretrained_model,
local_rank=args.local_rank)
encoder = nemo_nlp.nm.trainables.huggingface.BERT(
pretrained_model_name=args.pretrained_model)
.. tip::
Making embedding size (as well as all other tensor dimensions) divisible
Expand All @@ -100,7 +99,7 @@ learn positional encodings ``"learn_positional_encodings": True``:

.. code-block:: python
decoder = nemo_nlp.TransformerDecoderNM(
decoder = nemo_nlp.nm.trainables.TransformerDecoderNM(
d_model=args.d_model,
d_inner=args.d_inner,
num_layers=args.num_layers,
Expand All @@ -123,7 +122,7 @@ To load the pretrained parameters into decoder, we use ``restore_from`` attribut
Model training
--------------

To train the model run ``asr_postprocessor.py.py`` located in ``examples/nlp`` directory. We train with novograd optimizer :cite:`asr-imps-ginsburg2019stochastic`,
To train the model run ``asr_postprocessor.py.py`` located in ``examples/nlp/asr_postprocessor`` directory. We train with novograd optimizer :cite:`asr-imps-ginsburg2019stochastic`,
learning rate ``lr=0.001``, polynomial learning rate decay policy, ``1000`` warmup steps, per-gpu batch size of ``4096*8`` tokens, and ``0.25`` dropout probability.
We trained on 8 GPUS. To launch the training in multi-gpu mode run the following command:

Expand Down
48 changes: 28 additions & 20 deletions docs/sources/source/nlp/bert_pretraining.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@ Pretraining BERT
In this tutorial, we will build and train a masked language model, either from scratch or from a pretrained BERT model, using the BERT architecture :cite:`nlp-bert-devlin2018bert`.
Make sure you have ``nemo`` and ``nemo_nlp`` installed before starting this tutorial. See the :ref:`installation` section for more details.

The code used in this tutorial can be found at ``examples/nlp/bert_pretraining.py``.
The code used in this tutorial can be found at ``examples/nlp/language_modeling/bert_pretraining.py``.

.. tip::
Pretrained BERT models can be found at
`https://ngc.nvidia.com/catalog/models/nvidia:bertlargeuncasedfornemo <https://ngc.nvidia.com/catalog/models/nvidia:bertlargeuncasedfornemo>`__
`https://ngc.nvidia.com/catalog/models/nvidia:bertbaseuncasedfornemo <https://ngc.nvidia.com/catalog/models/nvidia:bertbaseuncasedfornemo>`__
`https://ngc.nvidia.com/catalog/models/nvidia:bertbasecasedfornemo <https://ngc.nvidia.com/catalog/models/nvidia:bertbasecasedfornemo>`__

Introduction
------------
Expand Down Expand Up @@ -61,7 +67,7 @@ If have an available vocab, say the ``vocab.txt`` file from any `pretrained BERT

.. code-block:: python
data_desc = BERTPretrainingDataDesc(args.dataset_name,
data_desc = nemo_nlp.data.BERTPretrainingDataDesc(args.dataset_name,
args.data_dir,
args.vocab_size,
args.sample_size,
Expand All @@ -76,11 +82,14 @@ To train on a Chinese dataset, you should use `NemoBertTokenizer`.
.. code-block:: python
# If you're using a custom vocabulary, create your tokenizer like this
tokenizer = SentencePieceTokenizer(model_path="tokenizer.model")
tokenizer.add_special_tokens(["[MASK]", "[CLS]", "[SEP]"])
tokenizer = nemo_nlp.data.SentencePieceTokenizer(model_path="tokenizer.model")
special_tokens = nemo_nlp.utils.MODEL_SPECIAL_TOKENS['bert']
tokenizer.add_special_tokens(special_tokens)
# Otherwise, create your tokenizer like this
tokenizer = NemoBertTokenizer(vocab_file="vocab.txt")
tokenizer = nemo_nlp.data.NemoBertTokenizer(vocab_file="vocab.txt")
# or
tokenizer = nemo_nlp.data.NemoBertTokenizer(pretrained_model="bert-base-uncased")
Create the model
----------------
Expand All @@ -105,7 +114,7 @@ We also need to define the BERT model that we will be pre-training. Here, you ca

.. code-block:: python
bert_model = nemo_nlp.huggingface.BERT(
bert_model = nemo_nlp.nm.trainables.huggingface.BERT(
vocab_size=args.vocab_size,
num_hidden_layers=args.num_hidden_layers,
hidden_size=args.hidden_size,
Expand All @@ -126,22 +135,22 @@ For the full list of BERT model names, check out `nemo_nlp.huggingface.BERT.list

.. code-block:: python
bert_model = nemo_nlp.huggingface.BERT(pretrained_model_name="bert-base-cased")
bert_model = nemo_nlp.nm.trainables.huggingface.BERT(pretrained_model_name="bert-base-cased")
Next, we will define our classifier and loss functions. We will demonstrate how to pre-train with both MLM (masked language model) and NSP (next sentence prediction) losses,
but you may observe higher downstream accuracy by only pre-training with MLM loss.

.. code-block:: python
mlm_classifier = nemo_nlp.BertTokenClassifier(
mlm_classifier = nemo_nlp.nm.trainables.BertTokenClassifier(
args.hidden_size,
num_classes=args.vocab_size,
activation=ACT2FN[args.hidden_act],
log_softmax=True)
mlm_loss_fn = nemo_nlp.MaskedLanguageModelingLossNM()
mlm_loss_fn = nemo_nlp.nm.losses.MaskedLanguageModelingLossNM()
nsp_classifier = nemo_nlp.SequenceClassifier(
nsp_classifier = nemo_nlp.nm.trainables.SequenceClassifier(
args.hidden_size,
num_classes=2,
num_layers=2,
Expand All @@ -150,7 +159,7 @@ but you may observe higher downstream accuracy by only pre-training with MLM los
nsp_loss_fn = nemo.backends.pytorch.common.CrossEntropyLoss()
bert_loss = nemo_nlp.LossAggregatorNM(num_inputs=2)
bert_loss = nemo_nlp.nm.losses.LossAggregatorNM(num_inputs=2)
Then, we create the pipeline from input to output that can be used for both training and evaluation:

Expand All @@ -159,7 +168,7 @@ For training from raw text use nemo_nlp.BertPretrainingDataLayer, for preprocess
.. code-block:: python
def create_pipeline(**args):
data_layer = nemo_nlp.BertPretrainingDataLayer(
data_layer = nemo_nlp.nm.data_layers.BertPretrainingDataLayer(
tokenizer,
data_file,
max_seq_length,
Expand All @@ -174,20 +183,19 @@ For training from raw text use nemo_nlp.BertPretrainingDataLayer, for preprocess
steps_per_epoch = len(data_layer) // (batch_size * args.num_gpus * args.batches_per_step)
input_ids, input_type_ids, input_mask, \
output_ids, output_mask, nsp_labels = data_layer()
input_data = data_layer()
hidden_states = bert_model(input_ids=input_ids,
token_type_ids=input_type_ids,
attention_mask=input_mask)
hidden_states = bert_model(input_ids=input_data.input_ids,
token_type_ids=input_data.input_type_ids,
attention_mask=input_data.input_mask)
mlm_logits = mlm_classifier(hidden_states=hidden_states)
mlm_loss = mlm_loss_fn(logits=mlm_logits,
output_ids=output_ids,
output_mask=output_mask)
output_ids=input_data.output_ids,
output_mask=input_data.output_mask)
nsp_logits = nsp_classifier(hidden_states=hidden_states)
nsp_loss = nsp_loss_fn(logits=nsp_logits, labels=nsp_labels)
nsp_loss = nsp_loss_fn(logits=nsp_logits, labels=input_data.labels)
loss = bert_loss(loss_1=mlm_loss, loss_2=nsp_loss)
Expand Down
4 changes: 4 additions & 0 deletions docs/sources/source/nlp/joint_intent_slot_filling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ There are four pre-trained BERT models that we can select from using the argumen
using the script for loading pre-trained models from `pytorch_transformers`. See the list of available pre-trained models
`here <https://huggingface.co/pytorch-transformers/pretrained_models.html>`__.

.. tip::

For pretraining BERT in NeMo and pretrained model checkpoints go to `BERT pretraining <https://nvidia.github.io/NeMo/nlp/bert_pretraining.html>`__.


Preliminaries
-------------
Expand Down
6 changes: 6 additions & 0 deletions docs/sources/source/nlp/ner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ Tutorial
Make sure you have ``nemo`` and ``nemo_nlp`` installed before starting this
tutorial. See the :ref:`installation` section for more details.

.. tip::

For pretraining BERT in NeMo and pretrained model checkpoints go to `BERT pretraining <https://nvidia.github.io/NeMo/nlp/bert_pretraining.html>`__.



Introduction
------------

Expand Down
Loading

0 comments on commit 880c730

Please sign in to comment.