My code seems to hang when skip_remainder_batch=False. #182

Fragile-azalea · 2022-08-09T02:58:46Z

Describe the bug
Hi, Authors. My code seems to hang when skip_remainder_batch=False.

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/microsoft/tutel --branch main
python3 -m pip uninstall tutel -y
python3 ./tutel/setup.py

cd ./tutel/tutel/examples/fairseq_moe
git clone https://github.com/facebookresearch/fairseq --branch main
cd fairseq/ && git checkout b5e7b250913120409b872a940fbafec4d43c7b13
# This patch is an example to train Fairseq MoE transformers.
# Note that the current patch only works for `legacy_ddp` backend, and `--checkpoint-activations` must be disabled.
git apply ../fairseq_patch.diff
python3 -m pip install omegaconf==2.0.5 hydra-core==1.0.7
python3 -m pip install --no-deps --editable .

#fix bug in https://github.com/facebookresearch/fairseq/blob/main/fairseq/tasks/translation.py#L441-L442
#get dataset followed by https://github.com/facebookresearch/fairseq/tree/main/examples/translation


 CUDA_VISIBLE_DEVICES=0,1  MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train  fairseq/data-bin/iwslt14.tokenized.de-en     --arch transformer_iwslt_de_en --share-decoder-input-output-embed     --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0     --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000     --dropout 0.3 --weight-decay 0.0001     --criterion label_smoothed_cross_entropy --label-smoothing 0.1     --max-tokens 4096 --eval-bleu     --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'     --eval-bleu-detok moses     --eval-bleu-remove-bpe  --eval-bleu-print-samples     --best-checkpoint-metric bleu --maximize-best-checkpoint-metric  --ddp-backend legacy_ddp --max-update 100000

Logs

2022-08-09 10:51:01 | INFO | fairseq.utils | rank   0: capabilities =  7.5  ; total memory = 10.761 GB ;[0/1773]
NVIDIA GeForce RTX 2080 Ti
2022-08-09 10:51:01 | INFO | fairseq.utils | rank   1: capabilities =  7.5  ; total memory = 10.761 GB ; name =
NVIDIA GeForce RTX 2080 Ti
2022-08-09 10:51:01 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers**********
*************
2022-08-09 10:51:01 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2022-08-09 10:51:01 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = Non
e
2022-08-09 10:51:01 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
2022-08-09 10:51:01 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
2022-08-09 10:51:01 | INFO | fairseq.trainer | loading train data for epoch 1
2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
kenized.de-en/train.de-en.de
2022-08-09 10:51:01 | INFO | fairseq.data.data_utils | loaded 160,239 examples from: fairseq/data-bin/iwslt14.to
kenized.de-en/train.de-en.en
2022-08-09 10:51:01 | INFO | fairseq.tasks.translation | fairseq/data-bin/iwslt14.tokenized.de-en train de-en 16
0239 examples
2022-08-09 10:51:01 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 or --am
p
2022-08-09 10:51:01 | INFO | fairseq.data.iterators | grouped total_num_itrs = 551
epoch 001:   0%|                                                          | 0/551 [00:00<?, ?it/s]2022-08-09 10:
51:01 | INFO | fairseq.trainer | begin training epoch 1
2022-08-09 10:51:01 | INFO | fairseq_cli.train | Start iterating over samples
/home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
 unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
/home/xinglinpan/tutel/tutel/examples/fairseq_moe/fairseq/fairseq/utils.py:374: UserWarning: amp_C fused kernels
 unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  warnings.warn(
epoch 001: 100%|▉| 550/551 [02:12<00:00,  4.54it/s, loss=9.244, nll_loss=8.59, ppl=385.3, wps=31462022-08-09 10:
53:14 | INFO | fairseq_cli.train | begin validation on "valid" subset
                                                                                                 2022-08-09 10:5
3:19 | INFO | fairseq.tasks.translation | example hypothesis: they don't don't don't don't don't don't don't't't
't
2022-08-09 10:53:19 | INFO | fairseq.tasks.translation | example reference: they're just not moving.

The possible problem is that not all devices are provided with data in the last iteration on the valid, so alltoall is always pending other processes. If SKIP_MOE=1, there is no this phenomenon.

The text was updated successfully, but these errors were encountered:

Fragile-azalea · 2022-08-09T06:49:16Z

{"max_len_a": 1.2, "max_len_b": 10} means that max_len will differ in different GPUs(please see https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335-L576).
so the alltoall of the largest length is always pending other processes.
One solution is that using {"max_len": 20} instead.
But I don't really understand the effect of this change on the BLEU score.

ghostplant · 2022-08-09T07:57:43Z

Thanks for your information. This is a dup of #173.

We'll update the fairseq patch to add inequivalent_tokens=True which is recently in tutel but not in fairseq patch. You may apply it by yourself temporarily.

Fragile-azalea · 2022-08-09T09:46:04Z

If I understand correctly, I respectfully disagree with this view. In #173, there are inequivalent tokens in different devices. However, in this case, there is no forward in some devices due to the difference in max_len. (https://github.com/facebookresearch/fairseq/blob/main/fairseq/sequence_generator.py#L335)

ghostplant · 2022-08-09T12:17:41Z

That's interesting. If it's true that one GPU performs 5 forwards and another GPU performs 6 forwards, does traditional data parallel even work? I think the application itself has to do something special to avoid that, right?

Fragile-azalea · 2022-08-09T14:41:21Z

The number of forwarding may be different in different GPUs during the validation process. In some codes, forwards are even only found on GPU0 during the validation process.

In this code, two sentences with different lengths(e.g, "How are you" and "Thank you") are distributed into two GPUs. BLEU score seems to be calculated word by word. As a result, "How" and "Thank" is viewed as data parallelism, and "are" and "you" is viewed as data parallelism too. However, there is no word that can be viewed as data parallelism with "you" in "How are you".

ghostplant · 2022-08-09T16:08:59Z

OK, this root cause makes sense. MoE layer within one process does not know application's purpose on whether other processes are going to forward MoE together with it or not. Even though it knows some of the processes are suspended and not to join in the validation procedure, part of expert parameters stored in those processes would be inaccessible by the validation process.

So, seems like there are no more solutions expect to do changes on application side, which is to open evaluation in every processes. Can you provide the code lines that performs related validation procedure?

Fragile-azalea · 2022-08-10T05:43:57Z

Yes, I pretty much agree that opening evaluation in every process can fix this hang. In this code, we delete some BLEU args (i.e. max_len_a, max_len_b) to fix it. So we retrain our transformer by CUDA_VISIBLE_DEVICES=0,1 MOE=3 L_AUX_WT=0.01 SKIP_EXPERT=1 fairseq-train fairseq/data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 10e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --eval-bleu --eval-bleu-args '{"beam": 5}' --eval-bleu-detok moses --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --ddp-backend legacy_ddp --max-update 100000. It seems that the code works well and few BLEU scores will be influenced by deleting these arguments. But I don't sure about it completely.

ghostplant added the duplicate This issue or pull request already exists label Aug 9, 2022

ghostplant added application patch and removed duplicate This issue or pull request already exists labels Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My code seems to hang when skip_remainder_batch=False. #182

My code seems to hang when skip_remainder_batch=False. #182

Fragile-azalea commented Aug 9, 2022 •

edited

Loading

Fragile-azalea commented Aug 9, 2022

ghostplant commented Aug 9, 2022 •

edited

Loading

Fragile-azalea commented Aug 9, 2022

ghostplant commented Aug 9, 2022

Fragile-azalea commented Aug 9, 2022

ghostplant commented Aug 9, 2022

Fragile-azalea commented Aug 10, 2022

My code seems to hang when skip_remainder_batch=False. #182

My code seems to hang when skip_remainder_batch=False. #182

Comments

Fragile-azalea commented Aug 9, 2022 • edited Loading

Fragile-azalea commented Aug 9, 2022

ghostplant commented Aug 9, 2022 • edited Loading

Fragile-azalea commented Aug 9, 2022

ghostplant commented Aug 9, 2022

Fragile-azalea commented Aug 9, 2022

ghostplant commented Aug 9, 2022

Fragile-azalea commented Aug 10, 2022

Fragile-azalea commented Aug 9, 2022 •

edited

Loading

ghostplant commented Aug 9, 2022 •

edited

Loading