-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL Asynchronous update timeout crash with Tutel MoE #185
Comments
Can you upgrade and keep NCCL version the same on all environments? Most of NCCL timeout issues were from libnccl legacy bugs, or inconsistent NCCL version problems. You can also run |
Thanks @ghostplant for your reply. The example I dug deeper and found that the execution freezes at this line in the code. |
OK, since If the combination above doesn't work, you have to change the way of data feeding in application side to guarantee all GPU always have same forwarding counts and execution orders. |
It worked when I set Just for clarification, in this DDP setting there are separate copies of local experts on each GPU but the data batch is divided among the GPUs, right? Also, the common architecture is copied on each GPU and is being synced after each pass? Thank you @ghostplant for your help! |
Since Case-1: where [GPU-0] [GPU-1] [...]
epoch0-step0 epoch0-step0
epoch0-step1 epoch0-step1
... ...
epoch0-step100 epoch0-step100
epoch0-step101 epoch1-step0 <--
epoch1-step0 epoch1-step1
... ...
Case-2: where [GPU-0] [GPU-1]
step-0 (bs=16) step-0 (bs=16)
step-1 (bs=16) step-1 (bs=16)
... ...
step-50 (bs=16) step-50 (bs=16)
step-51 (bs=3) step-51 (bs=11) <--
... ... |
Hi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting.
Can someone please help me in resolving this error?
Alternatively, can you release the object detection code that was used in the Tutel paper.
The text was updated successfully, but these errors were encountered: