Skip to content

Commit

Permalink
Merge r1.7.1 to main (#3824)
Browse files Browse the repository at this point in the history
* Tn bug 1.7.0 (#3730)

* fix es and fr bug

Signed-off-by: Yang Zhang <[email protected]>

* add file

Signed-off-by: Yang Zhang <[email protected]>

* [TTS] Fix bugs in E2E TTS, Mixer-TTS and FastPitch (#3740)

* fix bugs

Signed-off-by: Oktai Tatanov <[email protected]>

* fix bug in e2e tts and mixer tts

Signed-off-by: Oktai Tatanov <[email protected]>

* Mirror AN4 data while servers are down (#3743)

Signed-off-by: smajumdar <[email protected]>

* Bugfix for GPT eval  (#3744)

* use tokens_cut not tokens

Signed-off-by: ericharper <[email protected]>

* remove precision conversion and comment jit for bias gelu

Signed-off-by: ericharper <[email protected]>

* revert comment update mbs in config

Signed-off-by: ericharper <[email protected]>

* calculate micro_batch_size during complete and compute_logprobs

Signed-off-by: ericharper <[email protected]>

* ASR SSL update (#3746)

* ssl update

Signed-off-by: sam1373 <[email protected]>

* tutorial update

Signed-off-by: sam1373 <[email protected]>

* Fix SSL configs for 1.7 (#3748)

* ssl update

Signed-off-by: sam1373 <[email protected]>

* tutorial update

Signed-off-by: sam1373 <[email protected]>

* revert configs

Signed-off-by: sam1373 <[email protected]>

* revert configs

Signed-off-by: sam1373 <[email protected]>

* punct process bug fix (#3747)

Signed-off-by: ekmb <[email protected]>

Co-authored-by: Somshubra Majumdar <[email protected]>

* updated conformer models. (#3741)

Signed-off-by: Vahid <[email protected]>

Co-authored-by: Somshubra Majumdar <[email protected]>

* Yuya/megatron t5 glue eval (#3751)

* Add megatron t5 glue eval-only script

Signed-off-by: Yu Yao <[email protected]>

* Update megatron t5 glue eval default configs

Signed-off-by: Yu Yao <[email protected]>

* Update megatron t5 glue eval configs

Signed-off-by: Yu Yao <[email protected]>

* Update config comments

Signed-off-by: Yu Yao <[email protected]>

Co-authored-by: Yu Yao <[email protected]>

* Specify gpus in SSL notebook (#3753)

* ssl update

Signed-off-by: sam1373 <[email protected]>

* tutorial update

Signed-off-by: sam1373 <[email protected]>

* revert configs

Signed-off-by: sam1373 <[email protected]>

* revert configs

Signed-off-by: sam1373 <[email protected]>

* specify gpus

Signed-off-by: sam1373 <[email protected]>

* Duplex model inference fix, money encoder fix (#3754)

Signed-off-by: ekmb <[email protected]>

* Update docs for RNNT and overriding fused batch size (#3755)

Signed-off-by: smajumdar <[email protected]>

* fix consumed samples calculation + PTune Model bugs (#3738)

* fix the way computing consumed samples

Signed-off-by: Yi Dong <[email protected]>

* fixed ptune model

Signed-off-by: Yi Dong <[email protected]>

* make sure notebook is working

Signed-off-by: Yi Dong <[email protected]>

* added try-catch

Signed-off-by: Yi Dong <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* fix directories in ssl notebook (#3758)

* ssl update

Signed-off-by: sam1373 <[email protected]>

* tutorial update

Signed-off-by: sam1373 <[email protected]>

* revert configs

Signed-off-by: sam1373 <[email protected]>

* revert configs

Signed-off-by: sam1373 <[email protected]>

* specify gpus

Signed-off-by: sam1373 <[email protected]>

* update dirs

Signed-off-by: sam1373 <[email protected]>

* TN docs update (#3735)

* TN docs update: audio based docs added, quick start, ref fixed, etc

Signed-off-by: ekmb <[email protected]>

* add deployment script dir and Sp TN

Signed-off-by: ekmb <[email protected]>

Co-authored-by: Yang Zhang <[email protected]>

* Update Tacotron2_Training.ipynb (#3769)

Signed-off-by: Jason <[email protected]>

* fix dockerfile (#3778)

Signed-off-by: Yang Zhang <[email protected]>

* Prompt-Tuning-Documentation (#3777)

* Update megatron.rst

* Updated example prompt tuning script's doc string

* Update megatron.rst

* Update megatron.rst

Co-authored-by: Eric Harper <[email protected]>

* Prompt tuning bug fix (#3780)

* Making updated code backwards compatible with previous prompt tuned models

Signed-off-by: Virginia Adams <[email protected]>

* Fixed backward compatiablity bug

Signed-off-by: Virginia Adams <[email protected]>

* Removed random import

Signed-off-by: Virginia Adams <[email protected]>

Co-authored-by: Eric Harper <[email protected]>

* update branch

Signed-off-by: ericharper <[email protected]>

* revert changes (#3785)

Signed-off-by: Yang Zhang <[email protected]>

* Fixed soft prompt eval loading bug (#3805)

Signed-off-by: Virginia Adams <[email protected]>

* mT5 whole word masking and T5 finetuning config fixes (#3776)

* O2 and whole word masking changes

Signed-off-by: MaximumEntropy <[email protected]>

* Style

Signed-off-by: MaximumEntropy <[email protected]>

* Update yaml

Signed-off-by: MaximumEntropy <[email protected]>

* Tok and O2 fix

Signed-off-by: MaximumEntropy <[email protected]>

* Fix arg passing

Signed-off-by: MaximumEntropy <[email protected]>

* Fix checkpoint path

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Raise error if FP16 training is tried with O2 recipe. (#3806)

* raise error

Signed-off-by: ericharper <[email protected]>

* update assert

Signed-off-by: ericharper <[email protected]>

* update error message

Signed-off-by: ericharper <[email protected]>

* update error message

Signed-off-by: ericharper <[email protected]>

* update branch

Signed-off-by: ericharper <[email protected]>

* remove test

Signed-off-by: ericharper <[email protected]>

* revert bad merges

Signed-off-by: ericharper <[email protected]>

* revert change partitions

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: Oktai Tatanov <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Samuel Kriman <[email protected]>
Co-authored-by: Evelina <[email protected]>
Co-authored-by: Vahid Noroozi <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Yu Yao <[email protected]>
Co-authored-by: Yi Dong <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Jason <[email protected]>
Co-authored-by: Virginia Adams <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
  • Loading branch information
14 people authored Mar 14, 2022
1 parent b31d8aa commit bdb561c
Show file tree
Hide file tree
Showing 12 changed files with 88 additions and 73 deletions.
55 changes: 28 additions & 27 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -2181,33 +2181,34 @@ pipeline {
}


stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') {
when {
anyOf {
branch 'main'
changeRequest target: 'main'
}
}
failFast true
steps {
sh "python -m torch.distributed.launch --nproc_per_node=2 \
examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
--checkpoint_folder=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700 \
--checkpoint_name=model_optim_rng.pt \
--hparams_file=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700/hparams.yaml \
--nemo_file_path=examples/nlp/language_modeling/small_gpt.nemo \
--model_type=gpt \
--pipeline_model_parallel_size=1 \
--gpus_per_node=2 \
--tensor_model_parallel_size=2"
sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \
--model_file=examples/nlp/language_modeling/small_gpt.nemo \
--tokens_to_generate=32 \
--tensor_model_parallel_size=2 \
--prompt='This is a test.'"
sh "rm examples/nlp/language_modeling/small_gpt.nemo"
}
}
// TODO: Add this test back. Test was failing on CI machines due to HW error
// stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') {
// when {
// anyOf {
// branch 'main'
// changeRequest target: 'main'
// }
// }
// failFast true
// steps {
// sh "python -m torch.distributed.launch --nproc_per_node=2 \
// examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
// --checkpoint_folder=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700 \
// --checkpoint_name=model_optim_rng.pt \
// --hparams_file=/home/TestData/nlp/megatron_gpt/data/gpt/iter_0008700/hparams.yaml \
// --nemo_file_path=examples/nlp/language_modeling/small_gpt.nemo \
// --model_type=gpt \
// --pipeline_model_parallel_size=1 \
// --gpus_per_node=2 \
// --tensor_model_parallel_size=2"
// sh "python examples/nlp/language_modeling/megatron_gpt_eval.py \
// --model_file=examples/nlp/language_modeling/small_gpt.nemo \
// --tokens_to_generate=32 \
// --tensor_model_parallel_size=2 \
// --prompt='This is a test.'"
// sh "rm examples/nlp/language_modeling/small_gpt.nemo"
// }
// }
stage('L2: Megatron Change Partitions') {
when {
anyOf {
Expand Down
16 changes: 11 additions & 5 deletions docs/source/nlp/megatron.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,18 +173,20 @@ Prompt tuning is a continuous or soft prompt approach to finding the optimal pro
Implementation Overview
^^^^^^^^^^

Our current prompt tuning implementation adapt’s Lester et. al’s EMNLP 2021 "`The Power of Scale for Parameter-Efficient Prompt Tuning <https://arxiv.org/abs/2104.08691>`_" to prompt tuning for GPT style models. In this implementation, a number of soft tokens specified by the user are prepended to the beginning of the discrete token input embeddings during the forward pass. During training, all model parameters are frozen except for those corresponding to the soft tokens. Only the soft prompt parameters are updated via gradient decent in the backward pass. Each soft token has the same dimensionality as a regular token embedding from the model’s vocabulary corresponding to the ``hidden_size`` hyperparameter. Soft token embeddings can be initialized randomly or with selected existing embeddings from the pretrained model.
Our current prompt tuning implementation adapt’s Lester et. al’s EMNLP 2021 "`The Power of Scale for Parameter-Efficient Prompt Tuning <https://arxiv.org/abs/2104.08691>`_" to prompt tuning for GPT style models. In this implementation, a number of soft tokens specified by the user are prepended to the beginning of the discrete token input embeddings during the forward pass. During training, all model parameters are frozen except for those corresponding to the soft tokens. Only the soft prompt parameters are updated via gradient decent in the backward pass. Each soft token has the same dimensionality as a regular token embedding from the model’s vocabulary corresponding to the ``hidden_size`` hyperparameter. Soft token embeddings can be initialized randomly or with selected existing embeddings from the pretrained model.

As of NeMo 1.7 prompt tuning now works with tensor parallel > 1.

Data Formatting
^^^^^^^^^^

The dataset should be a .json file where each json object has 2 fields: ``prompt_tag`` and ``text``.
The dataset should be a .jsonl file where each json object has 3 fields: ``prompt_tag``, ``text``, and ``answer``.

.. code::
{"prompt_tag": [tag1], "text": [text1]}
{"prompt_tag": [tag1], "text": [text2]}
{"prompt_tag": [tag1], "text": [text3]}
{"prompt_tag": [tag1], "text": [text1], "answer": [answer1]}
{"prompt_tag": [tag1], "text": [text2], "answer": [answer2]}
{"prompt_tag": [tag1], "text": [text3], "answer": [answer3]}
.. _data-example-label:

Expand Down Expand Up @@ -218,6 +220,9 @@ Prompt Tuning Specific Config Values
* - **model.new_prompt_init_text**
- list of strings
- The text you want to use for soft prompt initalization if ``model.new_prompt_init_methods`` is set to ['text']. The text is tokenized and clipped or tiled to match ``model.num_prompt_tokens``. The vocab embeddings associated with each token are copied and use to initialize the soft prompts.
* - **model.calc_loss_on_answer_only**
- bool
- Whether to calculate cross entropy loss on the full text input or only the answer portion of the input during prompt tuning.
* - **model.data.train_ds**
- string
- path to training dataset .json or .jsonl file. See `Data Formatting`_ for an example
Expand All @@ -228,6 +233,7 @@ Prompt Tuning Specific Config Values

Example Prompt Tuning Command for the First Task
^^^^^^^^^^

.. code::
EXPR_NAME='winogrande_prompt_tuning'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ trainer:
enable_checkpointing: False
replace_sampler_ddp: False
max_epochs: null
max_steps: 1000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
max_steps: 3000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
log_every_n_steps: 10
val_check_interval: 50
val_check_interval: 250
limit_val_batches: 50
limit_test_batches: 500
accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
Expand Down Expand Up @@ -43,7 +43,7 @@ model:
# specify micro_batch_size, global_batch_size, and model parallelism
# gradient accumulation will be done automatically based on data_parallel_size
micro_batch_size: 4 # limited by GPU memory
global_batch_size: 16 # will use more micro batches to reach global batch size
global_batch_size: 8 # will use more micro batches to reach global batch size
tensor_model_parallel_size: 1 # intra-layer model parallelism
pipeline_model_parallel_size: 1 # inter-layer model parallelism

Expand Down Expand Up @@ -117,7 +117,7 @@ model:

optim:
name: fused_adam
lr: 2e-4
lr: 1e-5
weight_decay: 0.01
betas:
- 0.9
Expand All @@ -126,4 +126,4 @@ model:
name: CosineAnnealing
warmup_steps: 50
constant_steps: 10
min_lr: 2e-5
min_lr: 1e-6
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ exp_manager:
resume_ignore_no_checkpoint: True
create_checkpoint_callback: True
checkpoint_callback_params:
monitor: val_acc
monitor: validation_acc
save_top_k: 10
mode: max
always_save_nemo: False # TODO: add support
filename: 'megatron_t5--{val_acc:.3f}-{step}'
filename: 'megatron_t5--{validation_acc:.3f}-{step}'
model_parallel_size: ${model.tensor_model_parallel_size}
save_best_model: True

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ exp_manager:
resume_ignore_no_checkpoint: True
create_checkpoint_callback: True
checkpoint_callback_params:
monitor: val_acc
monitor: validation_acc
save_top_k: 10
mode: max
always_save_nemo: False # TODO: add support
filename: 'megatron_t5--{val_acc:.3f}-{step}'
filename: 'megatron_t5--{validation_acc:.3f}-{step}'
model_parallel_size: ${model.tensor_model_parallel_size}
save_best_model: True

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -162,18 +162,13 @@ def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id):
MaskedLmInstance = collections.namedtuple("MaskedLmInstance", ["index", "label"])


def is_start_piece(piece, tokenizer_type='wordpiece'):
def is_start_piece(piece):
"""Check if the current word piece is the starting piece. (BERT)"""
# When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
if tokenizer_type == 'wordpiece':
return not piece.startswith("##")
elif tokenizer_type == 'sentencepiece':
return piece.startswith('▁')
else:
raise ValueError(f"Tokenizer type {tokenizer_type} is not supported.")
return not piece.startswith("##")


def create_masked_lm_predictions(
Expand Down Expand Up @@ -217,15 +212,11 @@ def create_masked_lm_predictions(
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
if (
whole_word_masking
and len(cand_indexes) >= 1
and not is_start_piece(vocab_id_to_token_dict[token], tokenizer_type=tokenizer_type)
):
if whole_word_masking and len(cand_indexes) >= 1 and not is_start_piece(vocab_id_to_token_dict[token]):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
if is_start_piece(vocab_id_to_token_dict[token], tokenizer_type=tokenizer_type):
if is_start_piece(vocab_id_to_token_dict[token]):
token_boundary[i] = 1

output_tokens = list(tokens)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,10 @@ def __init__(
if not self.tokenizer.legacy:
raise ValueError("Sentencepiece Tokenizer must have legacy = False to add special tokens.")
self.tokenizer_type = 'sentencepiece'
if whole_word_masking:
raise ValueError(
"Whole word masking is not supported with sentencepiece tokenizers and only with wordpiece tokenizers. Please set it to False."
)

self.cls_id = tokenizer.cls_id
self.sep_id = tokenizer.sep_id
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,10 +139,6 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
tensor_model_parallel_size=cfg.get('tensor_model_parallel_size', 1),
)

# TODO: Not sure how to use lists of modules with PTL.
# This means we can only use pipeline parallelism without the interleaved schedule.
self.model = build_model(model_provider_func=self.model_provider_func, wrap_with_ddp=False)[0]

# Prompt tuning initialization
self.use_soft_prompts = self.cfg.get('use_soft_prompts', False)

Expand All @@ -156,12 +152,27 @@ def __init__(self, cfg: DictConfig, trainer: Trainer):
self.num_prompt_tokens = cfg.get('num_prompt_tokens', 100)

if self.cfg.get('existing_prompt_tags', None):
# Assign prompt tag ids if none were present in the config
if type(self.cfg.existing_prompt_tags[0]) == str:
existing_prompt_tags = self.cfg.existing_prompt_tags
num_prompt_tags = len(existing_prompt_tags)
existing_prompt_tags = [
(existing_prompt_tags[tag_id], tag_id + 1) for tag_id in range(num_prompt_tags)
]

with open_dict(self.cfg):
self.cfg.existing_prompt_tags = existing_prompt_tags

# Fill table with prev tuned prompt tags and their ids
self.prompt_table = set(self.cfg.existing_prompt_tags)

# Get max prompt id from table for starting point of new prompt ids
self.next_prompt_id = max(self.prompt_table, key=lambda x: x[1])[1]

# TODO: Not sure how to use lists of modules with PTL.
# This means we can only use pipeline parallelism without the interleaved schedule.
self.model = build_model(model_provider_func=self.model_provider_func, wrap_with_ddp=False)[0]

self.setup_optimizer_param_groups()

self.megatron_amp_o2 = cfg.get('megatron_amp_O2', False)
Expand Down Expand Up @@ -662,13 +673,13 @@ def setup(self, stage=None):
init_consumed_samples = 0
self.init_consumed_samples = init_consumed_samples

# Initalize soft prompts before loading datasets and training
if self.use_soft_prompts:
self.init_new_prompts()

if stage == 'predict':
return
else:
# Initalize soft prompts before loading datasets and training
if self.use_soft_prompts:
self.init_new_prompts()

# TODO: consider adding a ModelPT guard to check if model is being restored.
# allowing restored models to optionally setup datasets
self.build_train_valid_test_datasets()
Expand Down Expand Up @@ -737,6 +748,9 @@ def configure_optimizers(self):
fp32_grad_accum = False
# TODO: contiguous grad bucket for fp16 is also planned to be supported
contiguous_grad_bucket = False
raise ValueError(
"fp16 training is not yet supported with O2. Please set megatron_amp_O2 to False in the model config."
)

# TODO: this should be true when not using pipeline parallelism
# we will support that for bf16 when we have async handler from apex
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ def build_pretraining_data_loader(self, dataset, consumed_samples):
)

def setup(self, stage=None):
resume_checkpoint_path = self.trainer.checkpoint_connector.resume_checkpoint_path
resume_checkpoint_path = self.trainer.checkpoint_connector.resume_from_checkpoint_fit_path
if resume_checkpoint_path:
try:
init_consumed_samples = int(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,12 @@ def build_train_valid_test_datasets(self):
seed=self._cfg.seed,
skip_warmup=self._cfg.data.skip_warmup,
dataset_type=self._cfg.data.get('dataset_type', 't5'),
max_ngram_size=self._cfg.get('max_ngram_size', 10),
mean_ngram_size=self._cfg.get('mean_ngram_size', None),
geometric_dist=self._cfg.get('geometric_dist', True),
permutation=self._cfg.get('permutation', False),
whole_word_masking=self._cfg.get('whole_word_masking', True),
favor_long_ngrams=self._cfg.get('favor_long_ngrams', False),
max_ngram_size=self._cfg.data.get('max_ngram_size', 10),
mean_ngram_size=self._cfg.data.get('mean_ngram_size', None),
geometric_dist=self._cfg.data.get('geometric_dist', True),
permutation=self._cfg.data.get('permutation', False),
whole_word_masking=self._cfg.data.get('whole_word_masking', True),
favor_long_ngrams=self._cfg.data.get('favor_long_ngrams', False),
)
logging.info(f'Length of train dataset: {len(self._train_ds)}')
logging.info(f'Length of val dataset: {len(self._validation_ds)}')
Expand Down
2 changes: 1 addition & 1 deletion nemo/core/optim/optimizer_with_main_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ def __init__(
'which is supposed to be accumulated after grad op.'
)
assert contiguous_grad_bucket, (
'currently async_grad_allreduce is supported only ' 'with async_grad_allreduce.'
'currently async_grad_allreduce is supported only ' 'with contiguous_grad_bucket.'
)

self._fp32_grad_accum = fp32_grad_accum
Expand Down
5 changes: 2 additions & 3 deletions tools/text_processing_deployment/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,8 @@ RUN apt-get install build-essential -y && apt-get install wget -y
RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
RUN tar xzvf protobuf-2.5.0.tar.gz
RUN cd protobuf-2.5.0 && ./configure && make && make install && ldconfig
COPY ../../nemo_text_processing/ /tmp/nemo/nemo_text_processing/
RUN bash /tmp/nemo/nemo_text_processing/setup.sh
RUN conda install -c conda-forge thrax=1.3.4 -y
RUN git clone https://github.com/yzhang123/sparrowhawk.git
RUN cd sparrowhawk && git checkout test && apt-get install -y autoconf && bash autoreconf && ./configure && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"
RUN echo "DONE"

0 comments on commit bdb561c

Please sign in to comment.