Poor GPU utilization when using multi-node ZeRO-3 on small model #5488

BramVanroy · 2024-04-30T20:00:51Z

BramVanroy
Apr 30, 2024

After publishing Fietje, a continue pretrain of phi-2 (2.7B parameters), I got some questions about training efficiency. People rightly pointed out that GPU utilization is low of around 50%, despite a good memory allocation of around 92%. The run used 4 nodes of 4x A100 80GBs each. You can find these number in the WANDB report here. The full run logs are here.

On the hardware side, that seems strange. Our nodes have Dual HDR Infiniband interconnect and NVLink intraconnect, so I would assume that that is enough, or am I wrong?

On the model size, since the model is so small, we're thinking that that might be the culprit in combination with using ZeRO-3. To be honest I did not think too much about the ZeRO-3 config because I used it before with a larger model and it worked well, but perhaps that was a mistake.

Given that ZeRO-3 shards the model parameters (even if the model fits in memory?), that may lead to a lot of overhead in communication. Or is ZeRO-3 smart enough to not partition the model if it fits fully on a single GPU?

I used this accelerate-compatible config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  gradient_clipping: 1.0
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Is our hunch correct, and is ZeRO-3 not needed in this case? Does that imply that first trying ZeRO-1 is the way to go, only trying ZeRO-2 and then ZeRO-3 if you cannot fit a small batch size in previous stage? And what about the balance between using a full precision optimizer like AdamW vs adam 8bit? What should we base that decision on?

Answered by BramVanroy

May 3, 2024

For whoever reads this: my disastrous performance was caused by something else, I don't really remember why but I had CUDA_LAUNCH_BLOCKING=1 in my launching bash script, which caused the significant slow-down.

View full answer

tjruwase · 2024-04-30T20:31:29Z

tjruwase
Apr 30, 2024
Maintainer

Is our hunch correct, and is ZeRO-3 not needed in this case? Does that imply that first trying ZeRO-1 is the way to go, only trying

@BramVanroy, yes ZeRO-3 is inappropriate for this model size and is likely responsible for the poor perf.

0 replies

tjruwase · 2024-04-30T20:34:56Z

tjruwase
Apr 30, 2024
Maintainer

Or is ZeRO-3 smart enough to not partition the model if it fits fully on a single GPU?

@BramVanroy, unfortunately, ZeRO-3 cannot do this automatically.

0 replies

tjruwase · 2024-04-30T20:37:07Z

tjruwase
Apr 30, 2024
Maintainer

only trying ZeRO-2 and then ZeRO-3 if you cannot fit a small batch size in previous stage?

@BramVanroy, if unable to fit batch size, the more appropriate solutions are gradient checkpointing or gradient accumulation.

2 replies

BramVanroy Apr 30, 2024
Author

@tjruwase Thanks a lot for taking the time! Just to make sure that I understand correctly, in order of preference we should try to make the following work. Does that more or less make sense?

Regular training (no zero)
- If can't fit bs=1, use ZeRO-1
  - If can fit at least bs=1 with zero-1; use gradient accumulation to increase bs to wanted
  - If can't fit bs=1; use gradient checkpointing or cpu offload
  - If even that won't work, use zero-2
    - If can fit at least bs=1 with zero-2; use gradient accumulation to increase bs to wanted
    - If can't fit at least bs=1 with zero-2; use gradient checkpointing or cpu offload
    - If even that won't work, use zero-3
      - If can fit at least bs=1 with zero-3; use gradient accumulation to increase bs to wanted
      - If can't fit at least bs=1 with zero-3; use gradient checkpointing or cpu/nvme offload
      - If even that won't work, look into lower precision optimizers

tjruwase May 1, 2024
Maintainer

@BramVanroy, that looks great. In reality, we ought to provide a similar documentation (or equivalent auto-configuration feature) to help users choose the appropriate training setup. One minor addition is that if no-zero fails, to try gradient accumulation and gradient checkpointing before resorting to ZeRO-1.

BramVanroy · 2024-05-03T15:44:16Z

BramVanroy
May 3, 2024
Author

For whoever reads this: my disastrous performance was caused by something else, I don't really remember why but I had CUDA_LAUNCH_BLOCKING=1 in my launching bash script, which caused the significant slow-down.

1 reply

tjruwase May 3, 2024
Maintainer

@BramVanroy, thanks for sharing this update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor GPU utilization when using multi-node ZeRO-3 on small model #5488

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Poor GPU utilization when using multi-node ZeRO-3 on small model #5488

BramVanroy Apr 30, 2024

Replies: 4 comments · 3 replies

tjruwase Apr 30, 2024 Maintainer

tjruwase Apr 30, 2024 Maintainer

tjruwase Apr 30, 2024 Maintainer

BramVanroy Apr 30, 2024 Author

tjruwase May 1, 2024 Maintainer

BramVanroy May 3, 2024 Author

tjruwase May 3, 2024 Maintainer

BramVanroy
Apr 30, 2024

Replies: 4 comments 3 replies

tjruwase
Apr 30, 2024
Maintainer

tjruwase
Apr 30, 2024
Maintainer

tjruwase
Apr 30, 2024
Maintainer

BramVanroy Apr 30, 2024
Author

tjruwase May 1, 2024
Maintainer

BramVanroy
May 3, 2024
Author

tjruwase May 3, 2024
Maintainer