Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

Closed
ollmer opened this issue Oct 20, 2020 · 14 comments
Closed

Can't reproduce 10B parameters GPT-2 training, CUDA out of memory #477

ollmer opened this issue Oct 20, 2020 · 14 comments
Assignees

Comments

@ollmer
Copy link

ollmer commented Oct 20, 2020

Hi!
I'm trying to run 10B params model training on a 32 V100 GPUs, following the tutorial https://www.deepspeed.ai/tutorials/zero/#training-a-10b-parameter-gpt-2-model
But i consistently get CUDA out of memory even when i've reduced model size to 6B params and batch size to 1.
Here are my configs:

       --model-parallel-size 1 \
       --num-layers 40 \
       --hidden-size 3072 \
       --num-attention-heads 24 \
       --batch-size 1 \
       --seq-length 1024 \
       --max-position-embeddings 2048 \

deepspeed config

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 100,  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,  
  "fp16": {    "enabled": true,    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 0.0000001
  },
  "zero_optimization": {
    "stage":2,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 50000000,
    "allgather_bucket_size": 500000000
  }
}

Hardware setup is 2 DGX-2.
Do i need to turn on activations checkpointing to make it work? Or is there some other caveats?

@ollmer ollmer changed the title Can't reproduce 10B parameters GPT-2 model training Can't reproduce 10B parameters GPT-2 model training, CUDO out of memory Oct 20, 2020
@ollmer ollmer changed the title Can't reproduce 10B parameters GPT-2 model training, CUDO out of memory Can't reproduce 10B parameters GPT-2 training, CUDO out of memory Oct 20, 2020
@ollmer ollmer changed the title Can't reproduce 10B parameters GPT-2 training, CUDO out of memory Can't reproduce 10B parameters GPT-2 training, CUDA out of memory Oct 20, 2020
@tjruwase
Copy link
Contributor

@ollmer Can you clarify how GPU count and RAM size?

@ollmer
Copy link
Author

ollmer commented Oct 21, 2020

It runs on 32 V100 GPUs with 32Gb RAM each.

@ollmer
Copy link
Author

ollmer commented Oct 21, 2020

Also worth mention that It's fp16 mode. So i expect optimizer state to require 120Gb of ram, which should be 4 Gb per GPU in zero stage 2 mode.

@tjruwase
Copy link
Contributor

Can you please share your training log?

@tjruwase
Copy link
Contributor

@ollmer Just wanted to follow up whether you could share your training log? Thanks.

@ollmer
Copy link
Author

ollmer commented Oct 26, 2020

Sorry for delay. Here it is.
failed10b_train_2_cuda_oom.log

@tjruwase
Copy link
Contributor

Thanks for sharing the log. Can you try the following?

  1. Disable overlap_comm
  2. Reduce allgather_bucket_size to 1e8 or 5e7

@ollmer
Copy link
Author

ollmer commented Oct 26, 2020

I've tried it. Still got OOM.
failed10b_train_3_cuda_oom.log

@tjruwase
Copy link
Contributor

Thanks for sharing the log. I will try to repro this issue on my side.

@samyam
Copy link
Contributor

samyam commented Oct 30, 2020

@ollmer the OOM you are seeing should be resolved by enabling activation_checkpointing in addition to enabling zero_offload.

@ollmer
Copy link
Author

ollmer commented Oct 30, 2020

Hi @samyam!
I'm talking about zero-2 mode, not about offload to CPU. And that was exactly my question about checkpointing from the original post:

Do i need to turn on activations checkpointing to make it work? Or is there some other caveats?

I wonder what is the maximum size of the model that could be trained without checkpointing? Because it slows thing down a lot. The tutorial looks like it assumes that 10B model could be trained without checkpointing.

@samyam
Copy link
Contributor

samyam commented Oct 30, 2020

@ollmer I see now. Yes, on 32 V100 GPUs ZeRO-2 should work without offload, and yes you need to turn on activation checkpointing. Model size beyond a few billion parameters all need activation checkpointing to be enabled in addition to ZeRO-2 or ZeRO-Offload, which ever you might be using. We will clarify that on the tutorial.

The activation checkpointing will have about 33% overhead, often lower. Are you seeing a much larger overhead?

@ollmer
Copy link
Author

ollmer commented Oct 30, 2020

Thanks for clarification! I'll measure the exact slowdown numbers and post it here later.

jeffra added a commit that referenced this issue Apr 11, 2023
* Merge chatgpt v2 to v3 - finalized (#484)

* [squash] staging chatgpt v1 (#463)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Tunji Ruwase <[email protected]>

* [partial] formatting fixes

* quantizer fixes

* fix for bert tests

* formatting fixes

* re-enable _param_slice_mappings in z2

* Enable the QKV requires_grad when in training mode (#466)

Co-authored-by: Jeff Rasley <[email protected]>

* fixes for attention enable_training flag

* commit to trigger CI

* fix for distil-bert param

* fixes for training context errors

* remove reza's qkv-optimization (#469)

Co-authored-by: Jeff Rasley <[email protected]>

* Chatgpt - Fuse lora params at HybridEngine (#472)

Co-authored-by: Jeff Rasley <[email protected]>

* add option to enable non-pin mode (#473)

* Chatgpt - fuse lora non pinned case (#474)

* Fix fuse/unfuse lora for Z3 and non-pinned parameter

* unfuse_lora_weight for non-pinned case

* fix the multiple issue for lora parameters

* formatting

* fuse lora only when available

---------

Co-authored-by: Jeff Rasley <[email protected]>

* Chatgpt/release inference cache (#475)

* Fix fuse/unfuse lora for Z3 and non-pinned parameter

* unfuse_lora_weight for non-pinned case

* release/retake the inference cache after/before generate

* remove duplicated _fuse_lora function

* fix formatting

* fix hybrid-engine config issue

* update formatting

* Chatgpt - fuse qkv v2 (#478)

Co-authored-by: Jeff Rasley <[email protected]>

* ChatGPT: Refactor Hybrid Engine Config (#477)

Co-authored-by: Lok Chand Koppaka <[email protected]>

* Inference Workspace Tweaks (#481)

* Safety checks around inference workspace allocation, extra flushing

* Formatting fixes

* Merge fix

* Chatgpt/inference tp (#480)

* Update the merged-QKV weights only if there is difference with the model parameter

* remove the hard-coded size

* always reset qkv params to updated ones after running step

* Add the infernce-tp group and tensor sharding to run inference in model-parallel mode

* optimize the gather/mp-sharding part

* Add hybrid_engine changes

* fix config issue

* Formatting fixes. Reset_qkv duplicate removal.

* fix bloom container.

* fix format.

---------

Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Lok Chand Koppaka <[email protected]>

* fix formatting

* more clean-up

---------

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Tunji Ruwase <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Lok Chand Koppaka <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>

* fix a bug on lora-fusion (#487)

* Cholmes/v3 workspace bugfixes (#488)

* Miscellaneous workspace fixes, new config param

* Fix typo

---------

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Tunji Ruwase <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Lok Chand Koppaka <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
@xiexbing xiexbing self-assigned this Sep 8, 2023
@xiexbing
Copy link
Contributor

the software issue has been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants