Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

ollmer · 2020-10-20T09:01:35Z

Hi!
I'm trying to run 10B params model training on a 32 V100 GPUs, following the tutorial https://www.deepspeed.ai/tutorials/zero/#training-a-10b-parameter-gpt-2-model
But i consistently get CUDA out of memory even when i've reduced model size to 6B params and batch size to 1.
Here are my configs:

       --model-parallel-size 1 \
       --num-layers 40 \
       --hidden-size 3072 \
       --num-attention-heads 24 \
       --batch-size 1 \
       --seq-length 1024 \
       --max-position-embeddings 2048 \

deepspeed config

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 100,  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,  
  "fp16": {    "enabled": true,    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 0.0000001
  },
  "zero_optimization": {
    "stage":2,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 500000000
  }
}

Hardware setup is 2 DGX-2.
Do i need to turn on activations checkpointing to make it work? Or is there some other caveats?

The text was updated successfully, but these errors were encountered:

tjruwase · 2020-10-20T21:56:14Z

@ollmer Can you clarify how GPU count and RAM size?

ollmer · 2020-10-21T07:09:07Z

It runs on 32 V100 GPUs with 32Gb RAM each.

ollmer · 2020-10-21T07:11:03Z

Also worth mention that It's fp16 mode. So i expect optimizer state to require 120Gb of ram, which should be 4 Gb per GPU in zero stage 2 mode.

tjruwase · 2020-10-21T20:05:46Z

Can you please share your training log?

tjruwase · 2020-10-23T22:17:54Z

@ollmer Just wanted to follow up whether you could share your training log? Thanks.

ollmer · 2020-10-26T12:10:18Z

Sorry for delay. Here it is.
failed10b_train_2_cuda_oom.log

tjruwase · 2020-10-26T18:58:38Z

Thanks for sharing the log. Can you try the following?

Disable overlap_comm
Reduce allgather_bucket_size to 1e8 or 5e7

ollmer · 2020-10-26T20:13:01Z

I've tried it. Still got OOM.
failed10b_train_3_cuda_oom.log

tjruwase · 2020-10-26T20:47:12Z

Thanks for sharing the log. I will try to repro this issue on my side.

samyam · 2020-10-30T17:22:50Z

@ollmer the OOM you are seeing should be resolved by enabling activation_checkpointing in addition to enabling zero_offload.

ollmer · 2020-10-30T17:26:46Z

Hi @samyam!
I'm talking about zero-2 mode, not about offload to CPU. And that was exactly my question about checkpointing from the original post:

Do i need to turn on activations checkpointing to make it work? Or is there some other caveats?

I wonder what is the maximum size of the model that could be trained without checkpointing? Because it slows thing down a lot. The tutorial looks like it assumes that 10B model could be trained without checkpointing.

samyam · 2020-10-30T17:28:35Z

@ollmer I see now. Yes, on 32 V100 GPUs ZeRO-2 should work without offload, and yes you need to turn on activation checkpointing. Model size beyond a few billion parameters all need activation checkpointing to be enabled in addition to ZeRO-2 or ZeRO-Offload, which ever you might be using. We will clarify that on the tutorial.

The activation checkpointing will have about 33% overhead, often lower. Are you seeing a much larger overhead?

ollmer · 2020-10-30T17:31:31Z

Thanks for clarification! I'll measure the exact slowdown numbers and post it here later.

* Merge chatgpt v2 to v3 - finalized (#484) * [squash] staging chatgpt v1 (#463) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> * [partial] formatting fixes * quantizer fixes * fix for bert tests * formatting fixes * re-enable _param_slice_mappings in z2 * Enable the QKV requires_grad when in training mode (#466) Co-authored-by: Jeff Rasley <[email protected]> * fixes for attention enable_training flag * commit to trigger CI * fix for distil-bert param * fixes for training context errors * remove reza's qkv-optimization (#469) Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt - Fuse lora params at HybridEngine (#472) Co-authored-by: Jeff Rasley <[email protected]> * add option to enable non-pin mode (#473) * Chatgpt - fuse lora non pinned case (#474) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * fix the multiple issue for lora parameters * formatting * fuse lora only when available --------- Co-authored-by: Jeff Rasley <[email protected]> * Chatgpt/release inference cache (#475) * Fix fuse/unfuse lora for Z3 and non-pinned parameter * unfuse_lora_weight for non-pinned case * release/retake the inference cache after/before generate * remove duplicated _fuse_lora function * fix formatting * fix hybrid-engine config issue * update formatting * Chatgpt - fuse qkv v2 (#478) Co-authored-by: Jeff Rasley <[email protected]> * ChatGPT: Refactor Hybrid Engine Config (#477) Co-authored-by: Lok Chand Koppaka <[email protected]> * Inference Workspace Tweaks (#481) * Safety checks around inference workspace allocation, extra flushing * Formatting fixes * Merge fix * Chatgpt/inference tp (#480) * Update the merged-QKV weights only if there is difference with the model parameter * remove the hard-coded size * always reset qkv params to updated ones after running step * Add the infernce-tp group and tensor sharding to run inference in model-parallel mode * optimize the gather/mp-sharding part * Add hybrid_engine changes * fix config issue * Formatting fixes. Reset_qkv duplicate removal. * fix bloom container. * fix format. --------- Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> * fix formatting * more clean-up --------- Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * fix a bug on lora-fusion (#487) * Cholmes/v3 workspace bugfixes (#488) * Miscellaneous workspace fixes, new config param * Fix typo --------- Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Tunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Connor Holmes <[email protected]>

xiexbing · 2023-09-14T19:17:20Z

the software issue has been solved.

ollmer changed the title ~~Can't reproduce 10B parameters GPT-2 model training~~ Can't reproduce 10B parameters GPT-2 model training, CUDO out of memory Oct 20, 2020

ollmer changed the title ~~Can't reproduce 10B parameters GPT-2 model training, CUDO out of memory~~ Can't reproduce 10B parameters GPT-2 training, CUDO out of memory Oct 20, 2020

ollmer changed the title ~~Can't reproduce 10B parameters GPT-2 training, CUDO out of memory~~ Can't reproduce 10B parameters GPT-2 training, CUDA out of memory Oct 20, 2020

xiexbing self-assigned this Sep 8, 2023

xiexbing closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

ollmer commented Oct 20, 2020

tjruwase commented Oct 20, 2020

ollmer commented Oct 21, 2020

ollmer commented Oct 21, 2020

tjruwase commented Oct 21, 2020

tjruwase commented Oct 23, 2020

ollmer commented Oct 26, 2020 •

edited

Loading

tjruwase commented Oct 26, 2020

ollmer commented Oct 26, 2020

tjruwase commented Oct 26, 2020

samyam commented Oct 30, 2020

ollmer commented Oct 30, 2020 •

edited

Loading

samyam commented Oct 30, 2020 •

edited

Loading

ollmer commented Oct 30, 2020

xiexbing commented Sep 14, 2023

Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

Comments

ollmer commented Oct 20, 2020

tjruwase commented Oct 20, 2020

ollmer commented Oct 21, 2020

ollmer commented Oct 21, 2020

tjruwase commented Oct 21, 2020

tjruwase commented Oct 23, 2020

ollmer commented Oct 26, 2020 • edited Loading

tjruwase commented Oct 26, 2020

ollmer commented Oct 26, 2020

tjruwase commented Oct 26, 2020

samyam commented Oct 30, 2020

ollmer commented Oct 30, 2020 • edited Loading

samyam commented Oct 30, 2020 • edited Loading

ollmer commented Oct 30, 2020

xiexbing commented Sep 14, 2023

ollmer commented Oct 26, 2020 •

edited

Loading

ollmer commented Oct 30, 2020 •

edited

Loading

samyam commented Oct 30, 2020 •

edited

Loading