-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My code seems to hang when skip_remainder_batch=False. #182
Comments
{"max_len_a": 1.2, "max_len_b": 10} means that max_len will differ in different GPUs(please see https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335-L576). |
Thanks for your information. This is a dup of #173. We'll update the fairseq patch to add |
If I understand correctly, I respectfully disagree with this view. In #173, there are inequivalent tokens in different devices. However, in this case, there is no forward in some devices due to the difference in max_len. (https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335) |
That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to do something special to avoid that, right? |
The number of forwarding may be different in different GPUs during the validation process. In some codes, forwards are even only found on GPU0 during the validation process. In this code, two sentences with different lengths(e.g, "How are you" and "Thank you") are distributed into two GPUs. BLEU score seems to be calculated word by word. As a result, "How" and "Thank" is viewed as data parallelism, and "are" and "you" is viewed as data parallelism too. However, there is no word that can be viewed as data parallelism with "you" in "How are you". |
OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not. Even though it knows some of the processes are suspended and not to join in the validation procedure, part of expert parameters stored in those processes would be inaccessible by the validation process. So, seems like there are no more solutions expect to do changes on application side, which is to open evaluation in every processes. Can you provide the code lines that performs related validation procedure? |
Yes, I pretty much agree that opening evaluation in every process can fix this hang. In this code, we delete some BLEU args (i.e. max_len_a, max_len_b) to fix it. So we retrain our transformer by |
Describe the bug
Hi, Authors. My code seems to hang when skip_remainder_batch=False.
To Reproduce
Steps to reproduce the behavior:
Logs
The possible problem is that not all devices are provided with data in the last iteration on the valid, so alltoall is always pending other processes. If SKIP_MOE=1, there is no this phenomenon.
The text was updated successfully, but these errors were encountered: