-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bp of shared parameters and experts #161
Comments
Yes, TutelDistributedOptimizer is a replacement of Pytorch DDP in that example (helloworld_ddp_tutel) to make the whole model sychronization transparent. TutelDistributedOptimizer not only implements ZeRO optimization, but also leverages built-in mask (_tutel_expert) to distinguish whether a parameter is shared or from the creation of tutel.moe.moe_layer. Note that TutelDistributedOptimizer only treats parameters created by tutel.moe.moe_layer to be expert parameters. If the model never uses tutel.moe.moe_layer, there is no difference with Pytorch DDP (expect TutelDistributedOptimizer includes ZeRO feature). |
Thank you for your answer. |
To use TutelDistributedOptimizer which has parameter synchronization included, you should no longer warp the model with |
I notice the code in swin-transformer repo(https://github.com/microsoft/Swin-Transformer/blob/main/main_moe.py), which uses pytorch optimizer and ddp to train these moe models. Maybe there is something wrong. Thanks a lot. |
It is a version that manually distinguish parameter types, which follows |
Does it work by setting skip_allreduce as true in the scan function? |
To use tutel moe in Pytorch DDP backend, you need to not only set skip_allreduce as true in the moe scan function, but also recollect parameters with those masks, and tell DDP to skip synchronizing them by: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_ddp.py#L92. Otherwise, Pytorch DDP won't know they are expert parameters, so they'll be synchronized unexpectedly. |
The ddp in pytorch can not distinguish experts and other shared parameters. And experts may be updated with shared gradient.
The TutelDistributedOptimizer seems to be an implementation of zero, which does not affect the graident. How does tutel deal with the problem?
The text was updated successfully, but these errors were encountered: