Fix type checking of offload optimizer before checked class was imported #463

ollmer · 2020-10-07T13:40:07Z

This bug prevents to run Megatron-LM 10B offload training example

tjruwase · 2020-10-07T18:38:18Z

@ollmer Thanks for creating this PR. Can you please share a log of the error that this fixes?

rocm-mici · 2022-06-09T20:20:14Z

Can one of the admins verify this patch?

tjruwase · 2022-07-29T17:46:19Z

@ollmer, is this needed?

* Merge chatgpt v2 to v3 - finalized (#484) * [squash] staging chatgpt v1 (#463) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> * [partial] formatting fixes * quantizer fixes * fix for bert tests * formatting fixes * re-enable _param_slice_mappings in z2 * Enable the QKV requires_grad when in training mode (#466) Co-authored-by: Jeff Rasley <[email protected]> * fixes for attention enable_training flag * commit to trigger CI * fix for distil-bert param * fixes for training context errors * remove reza's qkv-optimization (#469) Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt - Fuse lora params at HybridEngine (#472) Co-authored-by: Jeff Rasley <[email protected]> * add option to enable non-pin mode (#473) * Chatgpt - fuse lora non pinned case (#474) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * fix the multiple issue for lora parameters * formatting * fuse lora only when available --------- Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt/release inference cache (#475) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * release/retake the inference cache after/before generate * remove duplicated _fuse_lora function * fix formatting * fix hybrid-engine config issue * update formatting * Chatgpt - fuse qkv v2 (#478) Co-authored-by: Jeff Rasley <[email protected]> * ChatGPT: Refactor Hybrid Engine Config (#477) Co-authored-by: Lok Chand Koppaka <[email protected]> * Inference Workspace Tweaks (#481) * Safety checks around inference workspace allocation, extra flushing * Formatting fixes * Merge fix * Chatgpt/inference tp (#480) * Update the merged-QKV weights only if there is difference with the model parameter * remove the hard-coded size * always reset qkv params to updated ones after running step * Add the infernce-tp group and tensor sharding to run inference in model-parallel mode * optimize the gather/mp-sharding part * Add hybrid_engine changes * fix config issue * Formatting fixes. Reset_qkv duplicate removal. * fix bloom container. * fix format. --------- Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> * fix formatting * more clean-up --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * fix a bug on lora-fusion (#487) * Cholmes/v3 workspace bugfixes (#488) * Miscellaneous workspace fixes, new config param * Fix typo --------- Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]>

awan-10 · 2023-08-25T16:54:19Z

Closing this very old PR as it is no longer relevant (Megatron-LM has been deprecated and Megatron-DeepSpeed has a different codebase altogether). Please reopen if needed.

Fix type checking of offload optimizer before checked class was imported

c1aec84

ollmer requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners October 7, 2020 13:40

Merge branch 'master' into master

d0dcc76

jeffra requested a review from mrwyattii as a code owner June 23, 2023 21:31

awan-10 self-assigned this Aug 18, 2023

awan-10 closed this Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix type checking of offload optimizer before checked class was imported #463

Fix type checking of offload optimizer before checked class was imported #463

ollmer commented Oct 7, 2020

tjruwase commented Oct 7, 2020

rocm-mici commented Jun 9, 2022

tjruwase commented Jul 29, 2022

awan-10 commented Aug 25, 2023

Fix type checking of offload optimizer before checked class was imported #463

Fix type checking of offload optimizer before checked class was imported #463

Conversation

ollmer commented Oct 7, 2020

tjruwase commented Oct 7, 2020

rocm-mici commented Jun 9, 2022

tjruwase commented Jul 29, 2022

awan-10 commented Aug 25, 2023