Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Asynchronous update timeout crash with Tutel MoE #185

Open
jinga-lala opened this issue Sep 6, 2022 · 5 comments
Open

NCCL Asynchronous update timeout crash with Tutel MoE #185

jinga-lala opened this issue Sep 6, 2022 · 5 comments
Assignees

Comments

@jinga-lala
Copy link

jinga-lala commented Sep 6, 2022

Hi, I am using Tutel library with MMAction framework to replicate Swin-v2 MoE performance described in the paper. However, I am facing this error when I try to train MoE in DDP setting.
Can someone please help me in resolving this error?
Alternatively, can you release the object detection code that was used in the Tutel paper.

E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

terminate called after throwing an instance of 'std::runtime_error'

  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLTOALL_BASE, Timeout(ms)=300000) ran for 306666 milliseconds before timing out.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6056 closing signal SIGTERM
@ghostplant
Copy link
Contributor

Can you upgrade and keep NCCL version the same on all environments? Most of NCCL timeout issues were from libnccl legacy bugs, or inconsistent NCCL version problems.

You can also run tutel.examples.helloworld in the same distributed setting to test whether it has the same NCCL timeout error.

@jinga-lala
Copy link
Author

jinga-lala commented Sep 6, 2022

Thanks @ghostplant for your reply. The example tutel.examples.helloworld works perfectly fine in single-gpu and multi-gpus, singe node setting. That's why, I think it is not a NCCL version issue.
My code works fine in single-gpu setting but in multi-gpu single node configuration it crashes with this error.

I dug deeper and found that the execution freezes at this line in the code.
https://github.com/microsoft/tutel/blob/17f4aab9b69cf50dcddd2b985907126379af1568/tutel/impls/moe_layer.py#L292

@ghostplant
Copy link
Contributor

ghostplant commented Sep 6, 2022

OK, since tutel.examples.helloworld works well, it should be related to inequivalent data sources stored on each GPU, which results in different planned iteration counts locally and thus triggers different number of model forwarding function. So, such timeout has to be solved at application side. But you can still try whether enabling both 2 following options can get rid of this problem: (1) setting capacity_factor = negative_value inside moe_layer creation in transformer initialization function; (2) always enabling _moe_layer_0.forward(.., inequivalent_tokens=True) in transformer forwarding function.

If the combination above doesn't work, you have to change the way of data feeding in application side to guarantee all GPU always have same forwarding counts and execution orders.

@jinga-lala
Copy link
Author

It worked when I set _moe_layer_0.forward(.., inequivalent_tokens=True) 🎉 .
Is it because in object detection the image sizes are different and so are the number of patches for forward pass in each of the moe models?

Just for clarification, in this DDP setting there are separate copies of local experts on each GPU but the data batch is divided among the GPUs, right? Also, the common architecture is copied on each GPU and is being synced after each pass?

Thank you @ghostplant for your help!

@ghostplant
Copy link
Contributor

ghostplant commented Sep 7, 2022

Since inequivalent_tokens=True works, it means there is no issue from "inequivalent forwarding counts". (See Case-1)
It is only helpful when for each iteration, the "tokens per batch" on each device is not the same with others. (See Case-2)

Case-1: where inequivalent_tokens=True is NOT helpful

        [GPU-0]          [GPU-1]        [...]
     epoch0-step0      epoch0-step0
     epoch0-step1      epoch0-step1
         ...                ...
     epoch0-step100    epoch0-step100
     epoch0-step101    epoch1-step0     <--
     epoch1-step0      epoch1-step1
         ...                ...

Case-2: where inequivalent_tokens=True is helpful

        [GPU-0]          [GPU-1]
     step-0 (bs=16)    step-0 (bs=16)
     step-1 (bs=16)    step-1 (bs=16)
         ...                ...
     step-50 (bs=16)   step-50 (bs=16)
     step-51 (bs=3)    step-51 (bs=11)  <--
         ...                ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants