-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when doing deepcopy of the model #177
Comments
If you pickle the model for single GPU, anything will be fine because AllToAll is not included in Tutel's MoE layer. Is that okay to your expectation? Pytorch's NCCL operations (e.g. AllToAll) don't support pickle, so I'm afraid that any MoE models having AllToAll in their forward pass will have the same issue. You may either ask Torch community to fix it up, or you do any workaround in ema that doesn't require deepcopy, or you can only boot MoE model in data parallel mode, which wouldn't have AllToAll in the forward pass, though the distributed performance will be very bad in large scale. |
Thanks for your prompt reply! |
You can go through these examples to convert training checkpoints between distributed version and single-device version: https://github.com/microsoft/tutel#how-to-convert-checkpoint-files-that-adapt-to-different-distributed-world-sizes |
Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe, is it still valid to use the tutel/checkpoint/gather.py to combine my checkpoints? |
Yes, both are compatible, as |
Hi, thanks for this awesome project!
I build my transformer model based on the MoeMlp layer. I use ema for better performance. However, when I trying to init my ema model with
ema_model = copy.deepcopy(my_transformer_model)
, I encounter the error:Could you help me with that? How can I use ema with tutel? Thanks!
The text was updated successfully, but these errors were encountered: