-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477
Comments
@ollmer Can you clarify how GPU count and RAM size? |
It runs on 32 V100 GPUs with 32Gb RAM each. |
Also worth mention that It's fp16 mode. So i expect optimizer state to require 120Gb of ram, which should be 4 Gb per GPU in zero stage 2 mode. |
Can you please share your training log? |
@ollmer Just wanted to follow up whether you could share your training log? Thanks. |
Sorry for delay. Here it is. |
Thanks for sharing the log. Can you try the following?
|
I've tried it. Still got OOM. |
Thanks for sharing the log. I will try to repro this issue on my side. |
@ollmer the OOM you are seeing should be resolved by enabling activation_checkpointing in addition to enabling zero_offload. |
Hi @samyam!
I wonder what is the maximum size of the model that could be trained without checkpointing? Because it slows thing down a lot. The tutorial looks like it assumes that 10B model could be trained without checkpointing. |
@ollmer I see now. Yes, on 32 V100 GPUs ZeRO-2 should work without offload, and yes you need to turn on activation checkpointing. Model size beyond a few billion parameters all need activation checkpointing to be enabled in addition to ZeRO-2 or ZeRO-Offload, which ever you might be using. We will clarify that on the tutorial. The activation checkpointing will have about 33% overhead, often lower. Are you seeing a much larger overhead? |
Thanks for clarification! I'll measure the exact slowdown numbers and post it here later. |
* Merge chatgpt v2 to v3 - finalized (#484) * [squash] staging chatgpt v1 (#463) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> * [partial] formatting fixes * quantizer fixes * fix for bert tests * formatting fixes * re-enable _param_slice_mappings in z2 * Enable the QKV requires_grad when in training mode (#466) Co-authored-by: Jeff Rasley <[email protected]> * fixes for attention enable_training flag * commit to trigger CI * fix for distil-bert param * fixes for training context errors * remove reza's qkv-optimization (#469) Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt - Fuse lora params at HybridEngine (#472) Co-authored-by: Jeff Rasley <[email protected]> * add option to enable non-pin mode (#473) * Chatgpt - fuse lora non pinned case (#474) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * fix the multiple issue for lora parameters * formatting * fuse lora only when available --------- Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt/release inference cache (#475) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * release/retake the inference cache after/before generate * remove duplicated _fuse_lora function * fix formatting * fix hybrid-engine config issue * update formatting * Chatgpt - fuse qkv v2 (#478) Co-authored-by: Jeff Rasley <[email protected]> * ChatGPT: Refactor Hybrid Engine Config (#477) Co-authored-by: Lok Chand Koppaka <[email protected]> * Inference Workspace Tweaks (#481) * Safety checks around inference workspace allocation, extra flushing * Formatting fixes * Merge fix * Chatgpt/inference tp (#480) * Update the merged-QKV weights only if there is difference with the model parameter * remove the hard-coded size * always reset qkv params to updated ones after running step * Add the infernce-tp group and tensor sharding to run inference in model-parallel mode * optimize the gather/mp-sharding part * Add hybrid_engine changes * fix config issue * Formatting fixes. Reset_qkv duplicate removal. * fix bloom container. * fix format. --------- Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> * fix formatting * more clean-up --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * fix a bug on lora-fusion (#487) * Cholmes/v3 workspace bugfixes (#488) * Miscellaneous workspace fixes, new config param * Fix typo --------- Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]>
the software issue has been solved. |
Hi!
I'm trying to run 10B params model training on a 32 V100 GPUs, following the tutorial https://www.deepspeed.ai/tutorials/zero/#training-a-10b-parameter-gpt-2-model
But i consistently get CUDA out of memory even when i've reduced model size to 6B params and batch size to 1.
Here are my configs:
deepspeed config
Hardware setup is 2 DGX-2.
Do i need to turn on activations checkpointing to make it work? Or is there some other caveats?
The text was updated successfully, but these errors were encountered: