ZeRO and model parallelism #1911

base-y · 2022-04-25T20:26:39Z

base-y
Apr 25, 2022

I went through the ZeRO KDD2020 presentation. Am I right to assume that ZeRO only optimizes memory in the data parallelism case?

Also, ZeRO-3 partitions the model params across GPUs if I am not wrong. If so, isnt this model parallelism by itself (splitting weight params across GPUs)? How is it different from other model parallelism approaches like megatron?

base-y · 2022-04-30T19:38:08Z

base-y
Apr 30, 2022
Author

@tjruwase can you help me with this?

0 replies

tjruwase · 2022-05-01T13:48:41Z

tjruwase
May 1, 2022
Maintainer

The main difference is in how layers are processed.

For ZeRO-3, each rank will:

Assemble the full layer by fetching the remaining parameters from other ranks.
Process a full input using the full layer.
After processing the layer free memory used by the fetched parameters.

For model parallelism (MP), each rank will

Fetch remote activations for the input from neighboring ranks
Process the input partition using the layer partition
Send remote activations for next layer to the neighboring ranks,

To summarize,

MP communicates activations, while ZeRO-3 communicates parameters
MP layers processe input partitions, while ZeRO-3 layers processe full inputs

Hope that helps.

3 replies

base-y May 1, 2022
Author

Thank you for the response! So, am I right to say that in model parallelism, the processing for one layer will happen only on one rank (the params for the layer are not shared with other ranks) while for zero3, every layer is reconstructed on every rank and the layer is processed in parallel on different ranks (on different minibatches)?

base-y May 1, 2022
Author

Also, is pipeline parallelism deepspeed's version of model parallelism? I dont specifically see terms like model parallelism on the pipeline parallelism docs page, so confused.

tjruwase May 3, 2022
Maintainer

Pipeline parallelism is a different kind of model sharding, where different layers are placed on different GPUs. This is a good paper on the topic.

Also, the term 3D parallelism refers to DataParallelism + Model Parallelism (a.k.a. Tensor Parallelism) + Pipeline Parallelism. See this paper.

tjruwase · 2022-05-01T13:49:30Z

tjruwase
May 1, 2022
Maintainer

Also, MP and ZeRO-3 are complementary.

3 replies

base-y May 1, 2022
Author

ZeRO-3 saves memory and redundancy in the data parallelism case by splitting the weight params across ranks. But is ZeRO-3 helpful in the model parallelism case where every rank's memory is already filled with their layer's parameters and when there is no redundancy (different ranks have different parts of the model)?

tjruwase May 3, 2022
Maintainer

Yes, ZeRO-3 is still useful if data parallelism is used along with the model parallelism.

base-y May 9, 2022
Author

Thanks! can you help me with this (#1931) too?

adhithadias · 2023-06-27T16:47:28Z

adhithadias
Jun 27, 2023

Hi! @tjruwase ! Are there any examples of using MP + Zero 2/3? I couldn't find any. I was looking at the Megatron-Deepspeed code base, it looks like Zero is not used in the examples. And in the DeepSpeedExamples repo, Zero is used but mpu is not used.

3 replies

tjruwase Jun 27, 2023
Maintainer

@adhithadias, ZeRO is used by most of the Megatron-DeepSpeed examples, e.g., https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/azure.

Keep in mind that ZeRO is enabled in the DeepSpeed engine using ds_config setting and does not require code changes.

tjruwase Jun 27, 2023
Maintainer

Also, have you gone through the ZeRO tutorials? https://www.deepspeed.ai/tutorials/zero/

adhithadias Jun 27, 2023

@tjruwase, thank you so much for sharing these links with me. I have a few more questions for clarity.

Does that mean I only have to pass the "stage" in the Deepspeed config alongside the mpu?

The documentation says we only need to define
mpu – Optional: A model parallelism unit object that implements get_{model,data}_parallel_{rank,group,world_size}() but the mpu module inside megatron contains many other different functions -- are these functions transformer specific? Is there any example with minimal changes to define the mpu functions other than the Megatron-Deepspeed code base?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeRO and model parallelism #1911

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ZeRO and model parallelism #1911

base-y Apr 25, 2022

Replies: 4 comments · 9 replies

base-y Apr 30, 2022 Author

tjruwase May 1, 2022 Maintainer

base-y May 1, 2022 Author

base-y May 1, 2022 Author

tjruwase May 3, 2022 Maintainer

tjruwase May 1, 2022 Maintainer

base-y May 1, 2022 Author

tjruwase May 3, 2022 Maintainer

base-y May 9, 2022 Author

adhithadias Jun 27, 2023

tjruwase Jun 27, 2023 Maintainer

tjruwase Jun 27, 2023 Maintainer

adhithadias Jun 27, 2023

base-y
Apr 25, 2022

Replies: 4 comments 9 replies

base-y
Apr 30, 2022
Author

tjruwase
May 1, 2022
Maintainer

base-y May 1, 2022
Author

base-y May 1, 2022
Author

tjruwase May 3, 2022
Maintainer

tjruwase
May 1, 2022
Maintainer

base-y May 1, 2022
Author

tjruwase May 3, 2022
Maintainer

base-y May 9, 2022
Author

adhithadias
Jun 27, 2023

tjruwase Jun 27, 2023
Maintainer

tjruwase Jun 27, 2023
Maintainer