Poor GPU utilization when using multi-node ZeRO-3 on small model #5488
-
After publishing Fietje, a continue pretrain of phi-2 (2.7B parameters), I got some questions about training efficiency. People rightly pointed out that GPU utilization is low of around 50%, despite a good memory allocation of around 92%. The run used 4 nodes of 4x A100 80GBs each. You can find these number in the WANDB report here. The full run logs are here. On the hardware side, that seems strange. Our nodes have Dual HDR Infiniband interconnect and NVLink intraconnect, so I would assume that that is enough, or am I wrong? On the model size, since the model is so small, we're thinking that that might be the culprit in combination with using ZeRO-3. To be honest I did not think too much about the ZeRO-3 config because I used it before with a larger model and it worked well, but perhaps that was a mistake. Given that ZeRO-3 shards the model parameters (even if the model fits in memory?), that may lead to a lot of overhead in communication. Or is ZeRO-3 smart enough to not partition the model if it fits fully on a single GPU? I used this accelerate-compatible config:
Is our hunch correct, and is ZeRO-3 not needed in this case? Does that imply that first trying ZeRO-1 is the way to go, only trying ZeRO-2 and then ZeRO-3 if you cannot fit a small batch size in previous stage? And what about the balance between using a full precision optimizer like AdamW vs adam 8bit? What should we base that decision on? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
@BramVanroy, yes ZeRO-3 is inappropriate for this model size and is likely responsible for the poor perf. |
Beta Was this translation helpful? Give feedback.
-
@BramVanroy, unfortunately, ZeRO-3 cannot do this automatically. |
Beta Was this translation helpful? Give feedback.
-
@BramVanroy, if unable to fit batch size, the more appropriate solutions are gradient checkpointing or gradient accumulation. |
Beta Was this translation helpful? Give feedback.
-
For whoever reads this: my disastrous performance was caused by something else, I don't really remember why but I had |
Beta Was this translation helpful? Give feedback.
For whoever reads this: my disastrous performance was caused by something else, I don't really remember why but I had
CUDA_LAUNCH_BLOCKING=1
in my launching bash script, which caused the significant slow-down.