Replies: 4 comments 9 replies
-
@tjruwase can you help me with this? |
Beta Was this translation helpful? Give feedback.
-
The main difference is in how layers are processed. For ZeRO-3, each rank will:
For model parallelism (MP), each rank will
To summarize,
Hope that helps. |
Beta Was this translation helpful? Give feedback.
-
Also, MP and ZeRO-3 are complementary. |
Beta Was this translation helpful? Give feedback.
-
Hi! @tjruwase ! Are there any examples of using MP + Zero 2/3? I couldn't find any. I was looking at the Megatron-Deepspeed code base, it looks like Zero is not used in the examples. And in the DeepSpeedExamples repo, Zero is used but mpu is not used. |
Beta Was this translation helpful? Give feedback.
-
I went through the ZeRO KDD2020 presentation. Am I right to assume that ZeRO only optimizes memory in the data parallelism case?
Also, ZeRO-3 partitions the model params across GPUs if I am not wrong. If so, isnt this model parallelism by itself (splitting weight params across GPUs)? How is it different from other model parallelism approaches like megatron?
Beta Was this translation helpful? Give feedback.
All reactions