-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise error if FP16 training is tried with O2 recipe. #3806
Conversation
Signed-off-by: ericharper <[email protected]>
Signed-off-by: ericharper <[email protected]>
This pull request introduces 2 alerts when merging a448013 into 6ee44c2 - view on LGTM.com new alerts:
|
@@ -783,6 +783,7 @@ def configure_optimizers(self): | |||
fp32_grad_accum = False | |||
# TODO: contiguous grad bucket for fp16 is also planned to be supported | |||
contiguous_grad_bucket = False | |||
raise ValueError("fp16 trained is not yet supported with O2. Please use O1.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean "training" and not "trained"? Also from a user point of view, we should probably is not supported with megatron_amp_O2=True
, please set it to False or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Signed-off-by: ericharper <[email protected]>
Signed-off-by: ericharper <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This pull request introduces 2 alerts when merging aa3d451 into 6ee44c2 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 7e95dc0 into 836d813 - view on LGTM.com new alerts:
|
* Tn bug 1.7.0 (#3730) * fix es and fr bug Signed-off-by: Yang Zhang <[email protected]> * add file Signed-off-by: Yang Zhang <[email protected]> * [TTS] Fix bugs in E2E TTS, Mixer-TTS and FastPitch (#3740) * fix bugs Signed-off-by: Oktai Tatanov <[email protected]> * fix bug in e2e tts and mixer tts Signed-off-by: Oktai Tatanov <[email protected]> * Mirror AN4 data while servers are down (#3743) Signed-off-by: smajumdar <[email protected]> * Bugfix for GPT eval (#3744) * use tokens_cut not tokens Signed-off-by: ericharper <[email protected]> * remove precision conversion and comment jit for bias gelu Signed-off-by: ericharper <[email protected]> * revert comment update mbs in config Signed-off-by: ericharper <[email protected]> * calculate micro_batch_size during complete and compute_logprobs Signed-off-by: ericharper <[email protected]> * ASR SSL update (#3746) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * Fix SSL configs for 1.7 (#3748) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * punct process bug fix (#3747) Signed-off-by: ekmb <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * updated conformer models. (#3741) Signed-off-by: Vahid <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * Yuya/megatron t5 glue eval (#3751) * Add megatron t5 glue eval-only script Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval default configs Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval configs Signed-off-by: Yu Yao <[email protected]> * Update config comments Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Yu Yao <[email protected]> * Specify gpus in SSL notebook (#3753) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * Duplex model inference fix, money encoder fix (#3754) Signed-off-by: ekmb <[email protected]> * Update docs for RNNT and overriding fused batch size (#3755) Signed-off-by: smajumdar <[email protected]> * fix consumed samples calculation + PTune Model bugs (#3738) * fix the way computing consumed samples Signed-off-by: Yi Dong <[email protected]> * fixed ptune model Signed-off-by: Yi Dong <[email protected]> * make sure notebook is working Signed-off-by: Yi Dong <[email protected]> * added try-catch Signed-off-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Eric Harper <[email protected]> * fix directories in ssl notebook (#3758) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * update dirs Signed-off-by: sam1373 <[email protected]> * TN docs update (#3735) * TN docs update: audio based docs added, quick start, ref fixed, etc Signed-off-by: ekmb <[email protected]> * add deployment script dir and Sp TN Signed-off-by: ekmb <[email protected]> Co-authored-by: Yang Zhang <[email protected]> * Update Tacotron2_Training.ipynb (#3769) Signed-off-by: Jason <[email protected]> * fix dockerfile (#3778) Signed-off-by: Yang Zhang <[email protected]> * Prompt-Tuning-Documentation (#3777) * Update megatron.rst * Updated example prompt tuning script's doc string * Update megatron.rst * Update megatron.rst Co-authored-by: Eric Harper <[email protected]> * Prompt tuning bug fix (#3780) * Making updated code backwards compatible with previous prompt tuned models Signed-off-by: Virginia Adams <[email protected]> * Fixed backward compatiablity bug Signed-off-by: Virginia Adams <[email protected]> * Removed random import Signed-off-by: Virginia Adams <[email protected]> Co-authored-by: Eric Harper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * revert changes (#3785) Signed-off-by: Yang Zhang <[email protected]> * Fixed soft prompt eval loading bug (#3805) Signed-off-by: Virginia Adams <[email protected]> * mT5 whole word masking and T5 finetuning config fixes (#3776) * O2 and whole word masking changes Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * Update yaml Signed-off-by: MaximumEntropy <[email protected]> * Tok and O2 fix Signed-off-by: MaximumEntropy <[email protected]> * Fix arg passing Signed-off-by: MaximumEntropy <[email protected]> * Fix checkpoint path Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Raise error if FP16 training is tried with O2 recipe. (#3806) * raise error Signed-off-by: ericharper <[email protected]> * update assert Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * remove test Signed-off-by: ericharper <[email protected]> * revert bad merges Signed-off-by: ericharper <[email protected]> * revert change partitions Signed-off-by: ericharper <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Co-authored-by: Oktai Tatanov <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: Samuel Kriman <[email protected]> Co-authored-by: Evelina <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Yu Yao <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Jason <[email protected]> Co-authored-by: Virginia Adams <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]>
* Tn bug 1.7.0 (#3730) * fix es and fr bug Signed-off-by: Yang Zhang <[email protected]> * add file Signed-off-by: Yang Zhang <[email protected]> * [TTS] Fix bugs in E2E TTS, Mixer-TTS and FastPitch (#3740) * fix bugs Signed-off-by: Oktai Tatanov <[email protected]> * fix bug in e2e tts and mixer tts Signed-off-by: Oktai Tatanov <[email protected]> * Mirror AN4 data while servers are down (#3743) Signed-off-by: smajumdar <[email protected]> * Bugfix for GPT eval (#3744) * use tokens_cut not tokens Signed-off-by: ericharper <[email protected]> * remove precision conversion and comment jit for bias gelu Signed-off-by: ericharper <[email protected]> * revert comment update mbs in config Signed-off-by: ericharper <[email protected]> * calculate micro_batch_size during complete and compute_logprobs Signed-off-by: ericharper <[email protected]> * ASR SSL update (#3746) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * Fix SSL configs for 1.7 (#3748) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * punct process bug fix (#3747) Signed-off-by: ekmb <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * updated conformer models. (#3741) Signed-off-by: Vahid <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * Yuya/megatron t5 glue eval (#3751) * Add megatron t5 glue eval-only script Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval default configs Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval configs Signed-off-by: Yu Yao <[email protected]> * Update config comments Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Yu Yao <[email protected]> * Specify gpus in SSL notebook (#3753) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * Duplex model inference fix, money encoder fix (#3754) Signed-off-by: ekmb <[email protected]> * Update docs for RNNT and overriding fused batch size (#3755) Signed-off-by: smajumdar <[email protected]> * fix consumed samples calculation + PTune Model bugs (#3738) * fix the way computing consumed samples Signed-off-by: Yi Dong <[email protected]> * fixed ptune model Signed-off-by: Yi Dong <[email protected]> * make sure notebook is working Signed-off-by: Yi Dong <[email protected]> * added try-catch Signed-off-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Eric Harper <[email protected]> * fix directories in ssl notebook (#3758) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * update dirs Signed-off-by: sam1373 <[email protected]> * TN docs update (#3735) * TN docs update: audio based docs added, quick start, ref fixed, etc Signed-off-by: ekmb <[email protected]> * add deployment script dir and Sp TN Signed-off-by: ekmb <[email protected]> Co-authored-by: Yang Zhang <[email protected]> * Update Tacotron2_Training.ipynb (#3769) Signed-off-by: Jason <[email protected]> * fix dockerfile (#3778) Signed-off-by: Yang Zhang <[email protected]> * Prompt-Tuning-Documentation (#3777) * Update megatron.rst * Updated example prompt tuning script's doc string * Update megatron.rst * Update megatron.rst Co-authored-by: Eric Harper <[email protected]> * Prompt tuning bug fix (#3780) * Making updated code backwards compatible with previous prompt tuned models Signed-off-by: Virginia Adams <[email protected]> * Fixed backward compatiablity bug Signed-off-by: Virginia Adams <[email protected]> * Removed random import Signed-off-by: Virginia Adams <[email protected]> Co-authored-by: Eric Harper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * revert changes (#3785) Signed-off-by: Yang Zhang <[email protected]> * Fixed soft prompt eval loading bug (#3805) Signed-off-by: Virginia Adams <[email protected]> * mT5 whole word masking and T5 finetuning config fixes (#3776) * O2 and whole word masking changes Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * Update yaml Signed-off-by: MaximumEntropy <[email protected]> * Tok and O2 fix Signed-off-by: MaximumEntropy <[email protected]> * Fix arg passing Signed-off-by: MaximumEntropy <[email protected]> * Fix checkpoint path Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Raise error if FP16 training is tried with O2 recipe. (#3806) * raise error Signed-off-by: ericharper <[email protected]> * update assert Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * remove test Signed-off-by: ericharper <[email protected]> * revert bad merges Signed-off-by: ericharper <[email protected]> * revert change partitions Signed-off-by: ericharper <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Co-authored-by: Oktai Tatanov <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: Samuel Kriman <[email protected]> Co-authored-by: Evelina <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Yu Yao <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Jason <[email protected]> Co-authored-by: Virginia Adams <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]>
* Tn bug 1.7.0 (#3730) * fix es and fr bug Signed-off-by: Yang Zhang <[email protected]> * add file Signed-off-by: Yang Zhang <[email protected]> * [TTS] Fix bugs in E2E TTS, Mixer-TTS and FastPitch (#3740) * fix bugs Signed-off-by: Oktai Tatanov <[email protected]> * fix bug in e2e tts and mixer tts Signed-off-by: Oktai Tatanov <[email protected]> * Mirror AN4 data while servers are down (#3743) Signed-off-by: smajumdar <[email protected]> * Bugfix for GPT eval (#3744) * use tokens_cut not tokens Signed-off-by: ericharper <[email protected]> * remove precision conversion and comment jit for bias gelu Signed-off-by: ericharper <[email protected]> * revert comment update mbs in config Signed-off-by: ericharper <[email protected]> * calculate micro_batch_size during complete and compute_logprobs Signed-off-by: ericharper <[email protected]> * ASR SSL update (#3746) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * Fix SSL configs for 1.7 (#3748) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * punct process bug fix (#3747) Signed-off-by: ekmb <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * updated conformer models. (#3741) Signed-off-by: Vahid <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> * Yuya/megatron t5 glue eval (#3751) * Add megatron t5 glue eval-only script Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval default configs Signed-off-by: Yu Yao <[email protected]> * Update megatron t5 glue eval configs Signed-off-by: Yu Yao <[email protected]> * Update config comments Signed-off-by: Yu Yao <[email protected]> Co-authored-by: Yu Yao <[email protected]> * Specify gpus in SSL notebook (#3753) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * Duplex model inference fix, money encoder fix (#3754) Signed-off-by: ekmb <[email protected]> * Update docs for RNNT and overriding fused batch size (#3755) Signed-off-by: smajumdar <[email protected]> * fix consumed samples calculation + PTune Model bugs (#3738) * fix the way computing consumed samples Signed-off-by: Yi Dong <[email protected]> * fixed ptune model Signed-off-by: Yi Dong <[email protected]> * make sure notebook is working Signed-off-by: Yi Dong <[email protected]> * added try-catch Signed-off-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Eric Harper <[email protected]> * fix directories in ssl notebook (#3758) * ssl update Signed-off-by: sam1373 <[email protected]> * tutorial update Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * revert configs Signed-off-by: sam1373 <[email protected]> * specify gpus Signed-off-by: sam1373 <[email protected]> * update dirs Signed-off-by: sam1373 <[email protected]> * TN docs update (#3735) * TN docs update: audio based docs added, quick start, ref fixed, etc Signed-off-by: ekmb <[email protected]> * add deployment script dir and Sp TN Signed-off-by: ekmb <[email protected]> Co-authored-by: Yang Zhang <[email protected]> * Update Tacotron2_Training.ipynb (#3769) Signed-off-by: Jason <[email protected]> * fix dockerfile (#3778) Signed-off-by: Yang Zhang <[email protected]> * Prompt-Tuning-Documentation (#3777) * Update megatron.rst * Updated example prompt tuning script's doc string * Update megatron.rst * Update megatron.rst Co-authored-by: Eric Harper <[email protected]> * Prompt tuning bug fix (#3780) * Making updated code backwards compatible with previous prompt tuned models Signed-off-by: Virginia Adams <[email protected]> * Fixed backward compatiablity bug Signed-off-by: Virginia Adams <[email protected]> * Removed random import Signed-off-by: Virginia Adams <[email protected]> Co-authored-by: Eric Harper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * revert changes (#3785) Signed-off-by: Yang Zhang <[email protected]> * Fixed soft prompt eval loading bug (#3805) Signed-off-by: Virginia Adams <[email protected]> * mT5 whole word masking and T5 finetuning config fixes (#3776) * O2 and whole word masking changes Signed-off-by: MaximumEntropy <[email protected]> * Style Signed-off-by: MaximumEntropy <[email protected]> * Update yaml Signed-off-by: MaximumEntropy <[email protected]> * Tok and O2 fix Signed-off-by: MaximumEntropy <[email protected]> * Fix arg passing Signed-off-by: MaximumEntropy <[email protected]> * Fix checkpoint path Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Raise error if FP16 training is tried with O2 recipe. (#3806) * raise error Signed-off-by: ericharper <[email protected]> * update assert Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update error message Signed-off-by: ericharper <[email protected]> * update branch Signed-off-by: ericharper <[email protected]> * remove test Signed-off-by: ericharper <[email protected]> * revert bad merges Signed-off-by: ericharper <[email protected]> * revert change partitions Signed-off-by: ericharper <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Co-authored-by: Oktai Tatanov <[email protected]> Co-authored-by: Somshubra Majumdar <[email protected]> Co-authored-by: Samuel Kriman <[email protected]> Co-authored-by: Evelina <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Yu Yao <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Jason <[email protected]> Co-authored-by: Virginia Adams <[email protected]> Co-authored-by: Sandeep Subramanian <[email protected]>
What does this PR do ?
Raise an error if fp16 training is tried with O2 recipe. This is not yet supported.
Collection: [NLP]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information