Raise error if FP16 training is tried with O2 recipe. #3806

ericharper · 2022-03-07T21:10:00Z

What does this PR do ?

Raise an error if fp16 training is tried with O2 recipe. This is not yet supported.

Collection: [NLP]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: ericharper <[email protected]>

lgtm-com · 2022-03-07T21:25:43Z

This pull request introduces 2 alerts when merging a448013 into 6ee44c2 - view on LGTM.com

new alerts:

2 for Unused local variable

MaximumEntropy · 2022-03-07T22:28:33Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

@@ -783,6 +783,7 @@ def configure_optimizers(self):
                fp32_grad_accum = False
                # TODO: contiguous grad bucket for fp16 is also planned to be supported
                contiguous_grad_bucket = False
+                raise ValueError("fp16 trained is not yet supported with O2. Please use O1.")


I think you mean "training" and not "trained"? Also from a user point of view, we should probably is not supported with megatron_amp_O2=True, please set it to False or something like that?

Signed-off-by: ericharper <[email protected]>

MaximumEntropy

LGTM!

lgtm-com · 2022-03-07T22:48:14Z

This pull request introduces 2 alerts when merging aa3d451 into 6ee44c2 - view on LGTM.com

new alerts:

2 for Unused local variable

lgtm-com · 2022-03-07T23:18:23Z

This pull request introduces 2 alerts when merging 7e95dc0 into 836d813 - view on LGTM.com

new alerts:

2 for Unused local variable

* Tn bug 1.7.0 (#3730) * fix es and fr bug Signed-off-by: Yang Zhang <[email protected]> * add file Signed-off-by: Yang Zhang <[email protected]> * [TTS] Fix bugs in E2E TTS, Mixer-TTS and FastPitch (#3740) * fix bugs Signed-off-by: Oktai Tatanov <[email protected]> * fix bug in e2e tts and mixer tts Signed-off-by: Oktai Tatanov <[email protected]> * Mirror AN4 data while servers are down (#3743) Signed-off-by: smajumdar <[email protected]> * Bugfix for GPT eval (#3744) * use tokens_cut not tokens Signed-off-by: ericharper <[email protected]> * remove precision conversion and comment jit for bias gelu Signed-off-by: ericharper <[email protected]> * revert comment update mbs in config Signed-off-by: ericharper <[email protected]> * calculate micro_batch_size during complete and compute_logprobs Signed-off-by: ericharper <[email protected]> * ASR SSL update (#3746) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * Fix SSL configs for 1.7 (#3748) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * punct process bug fix (#3747) Signed-off-by: ekmb <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * updated conformer models. (#3741) Signed-off-by: Vahid <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * Yuya/megatron t5 glue eval (#3751) * Add megatron t5 glue eval-only script Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval default configs Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval configs Signed-off-by: Yu Yao <[email protected]> * Update config comments Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Yu Yao <[email protected]> * Specify gpus in SSL notebook (#3753) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * Duplex model inference fix, money encoder fix (#3754) Signed-off-by: ekmb <[email protected]> * Update docs for RNNT and overriding fused batch size (#3755) Signed-off-by: smajumdar <[email protected]> * fix consumed samples calculation + PTune Model bugs (#3738) * fix the way computing consumed samples Signed-off-by: Yi Dong <[email protected]> * fixed ptune model Signed-off-by: Yi Dong <[email protected]> * make sure notebook is working Signed-off-by: Yi Dong <[email protected]> * added try-catch Signed-off-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Eric Harper <[email protected]> * fix directories in ssl notebook (#3758) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * update dirs Signed-off-by: sam1373 <[email protected]> * TN docs update (#3735) * TN docs update: audio based docs added, quick start, ref fixed, etc Signed-off-by: ekmb <[email protected]> * add deployment script dir and Sp TN Signed-off-by: ekmb <[email protected]> Co-authored-by: Yang Zhang <[email protected]> * Update Tacotron2_Training.ipynb (#3769) Signed-off-by: Jason <[email protected]> * fix dockerfile (#3778) Signed-off-by: Yang Zhang <[email protected]> * Prompt-Tuning-Documentation (#3777) * Update megatron.rst * Updated example prompt tuning script's doc string * Update megatron.rst * Update megatron.rst Co-authored-by: Eric Harper <[email protected]> * Prompt tuning bug fix (#3780) * Making updated code backwards compatible with previous prompt tuned models Signed-off-by: Virginia Adams <[email protected]> * Fixed backward compatiablity bug Signed-off-by: Virginia Adams <[email protected]> * Removed random import Signed-off-by: Virginia Adams <[email protected]> Co-authored-by: Eric Harper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * revert changes (#3785) Signed-off-by: Yang Zhang <[email protected]> * Fixed soft prompt eval loading bug (#3805) Signed-off-by: Virginia Adams <[email protected]> * mT5 whole word masking and T5 finetuning config fixes (#3776) * O2 and whole word masking changes Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * Update yaml Signed-off-by: MaximumEntropy <[email protected]> * Tok and O2 fix Signed-off-by: MaximumEntropy <[email protected]> * Fix arg passing Signed-off-by: MaximumEntropy <[email protected]> * Fix checkpoint path Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Raise error if FP16 training is tried with O2 recipe. (#3806) * raise error Signed-off-by: ericharper <[email protected]> * update assert Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * remove test Signed-off-by: ericharper <[email protected]> * revert bad merges Signed-off-by: ericharper <[email protected]> * revert change partitions Signed-off-by: ericharper <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Co-authored-by: Oktai Tatanov <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: Samuel Kriman <[email protected]> Co-authored-by: Evelina <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Yu Yao <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Jason <[email protected]> Co-authored-by: Virginia Adams <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]>

ericharper added 3 commits March 7, 2022 14:06

raise error

93d1201

Signed-off-by: ericharper <[email protected]>

update assert

b8c2430

Signed-off-by: ericharper <[email protected]>

Merge branch 'r1.7.1' into gpt_fp16_o2

a448013

ericharper requested a review from MaximumEntropy March 7, 2022 21:27

MaximumEntropy suggested changes Mar 7, 2022

View reviewed changes

ericharper added 2 commits March 7, 2022 15:32

update error message

57a1128

Signed-off-by: ericharper <[email protected]>

update error message

aa3d451

Signed-off-by: ericharper <[email protected]>

MaximumEntropy self-requested a review March 7, 2022 22:36

MaximumEntropy approved these changes Mar 7, 2022

View reviewed changes

Merge branch 'r1.7.1' into gpt_fp16_o2

7e95dc0

ericharper merged commit d5ad011 into r1.7.1 Mar 7, 2022

ericharper deleted the gpt_fp16_o2 branch March 7, 2022 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise error if FP16 training is tried with O2 recipe. #3806

Raise error if FP16 training is tried with O2 recipe. #3806

ericharper commented Mar 7, 2022 •

edited

Loading

lgtm-com bot commented Mar 7, 2022

MaximumEntropy Mar 7, 2022

ericharper Mar 7, 2022

MaximumEntropy left a comment

lgtm-com bot commented Mar 7, 2022

lgtm-com bot commented Mar 7, 2022

Raise error if FP16 training is tried with O2 recipe. #3806

Raise error if FP16 training is tried with O2 recipe. #3806

Conversation

ericharper commented Mar 7, 2022 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

lgtm-com bot commented Mar 7, 2022

MaximumEntropy Mar 7, 2022

Choose a reason for hiding this comment

ericharper Mar 7, 2022

Choose a reason for hiding this comment

MaximumEntropy left a comment

Choose a reason for hiding this comment

lgtm-com bot commented Mar 7, 2022

lgtm-com bot commented Mar 7, 2022

ericharper commented Mar 7, 2022 •

edited

Loading