How to reproduce your results on WMT'14 ENDE datasets of "Attention is All You Need"? #637

SkyAndCloud · 2018-03-26T02:58:12Z

Hi, I want to reproduce the results on WMT'14 ENDE datasets of "Attention is All You Need" paper. I have read OpenNMT-FAQ and I want to know the exact details about your experiments:

Did you do the BPE or word spiece?
What's the exact BLEU score of your experiments on WMT'14 ENDE dataset?
Is there some other differences between README's steps and transformer experiments? If yes please provide a complete tutorial so as to help us reproduce your results.

Thank you very much! @srush

srush · 2018-03-26T04:41:55Z

Yes, happy to. We are planning on posting these models and more details

It would be great if you could reproduce.

Preprocessing is the same as here: https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
26.7
It should be the same. My command:

python  train.py -data /tmp/de2/data -save_model /tmp/extra -gpuid 1 \                                                                                                                                                                                                                                                                               
        -layers 6 -rnn_size 512 -word_vec_size 512 -batch_type tokens -batch_size 4096 \                                                                                                                          
        -epochs 50  -max_generator_batches 32 -normalization tokens -dropout 0.1 -accum_count 4 \                                                                                                                 
        -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer \                                                                                                                  
        -position_encoding -param_init 0 -warmup_steps 16000 -learning_rate 2 -param_init_glorot \                                                                                                                
        -start_checkpoint_at 5 -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 -report_every 1000

SkyAndCloud · 2018-03-26T08:12:05Z

Thanks for you quick reply. I'll try to reproduce this result on OpenNMT-py.

vince62s · 2018-03-26T10:20:35Z

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

When preprocessing, make sure you use a sequence length of 100 and -share_vocab

Cheers.

Epsilon-Lee · 2018-04-07T02:10:30Z

Hi Vincent @vince62s, I wonder how you preprocess the WMT14-EN-DE corpus? I download the data with your provided link.

Why every word is begin with a _? E.g. in commoncrawl.de-en.de.sp file: ▁Auf ▁den ▁beide n ▁Projekt seiten ▁( ▁v b B K M ▁und ▁ py B K M ▁) ▁gibt ▁es ▁mehr ▁Details .
What is the meaning of .sp as suffix for each file?
How you tokenize en and de language? Are you using moses' tokenizer?
What is the wmtende.model file?
What is the wmtende.vocab file? Is this the BPE learned joint vocab?
What is the merge number when BPEing? Is that 32k?

Thanks very much!

vince62s · 2018-04-07T06:44:18Z

I used Sentence Piece to tokenize the corpus (instead of BPE)
this is why there is this separator.
look at the scripts here https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
I did these for the onmt-tf version but preparation is similar.

taoleicn · 2018-04-09T19:12:09Z

Hi @vince62s

I'm getting a bit lower BLEU after following the suggestions in this thread. Could you take a look and see if we are using the same config?

After 21 epochs, my 6-layer transformer model gets 26.02 / 27.21 on valid and test set. and I guess you got ~26.4 / 27.8?

Here are the commands I used for preprocessing, training and evaluation

preprocessing (-share_vocab, sequence length 100):

python preprocess.py -train_src ../wmt-en-de/train.en.shuf 
    -train_tgt ../wmt-en-de/train.de.shuf 
    -valid_src ../wmt-en-de/valid.en -valid_tgt ../wmt-en-de/valid.de 
    -save_data ../wmt-en-de/processed 
    -src_seq_length 100 -tgt_seq_length 100 
    -max_shard_size 200000000 -share_vocab

training (bs=20k, warmup=16k):

python train.py -gpuid 0 -rnn_size 512 -word_vec_size 512 -batch_type tokens 
    -batch_size 5120 -accum_count 4 -epochs 50  -max_generator_batches 32 
    -normalization tokens -dropout 0.1 -max_grad_norm 0 -optim adam 
    -encoder_type transformer -decoder_type transformer -position_encoding 
    -param_init 0 -warmup_steps 16000 -learning_rate 2 -start_checkpoint_at 10 
    -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 
    -data ../wmt-en-de/processed -param_init_glorot -layers 6 -report_every 1000

translate (alpha=0.6):

python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu

evaluate:

perl tools/multi-bleu-detok.perl ~/wmt-en-de/valid.de.detok < t2048.e21.1.out.detok
perl tools/multi-bleu-detok.perl ~/wmt-en-de/newstest2017-ende-ref.de < t2048.e21.2.out.detok

I'm using the sentence piece model wmtende.model for de-tokenization.

Thanks!

vince62s · 2018-04-09T19:32:23Z

I used warmup 8000 and optim sparseadam.
Other settings seem similar unless I am missing somehting.
Just one thing though.
During my run, end epoch was 20 and then I trained from.

However as you can see in Issues/PR there is a bug Adam states are reset during a train_from

It may have a very slight impact.

I think if you let it go, you will end up with similar results to mine.(go up to 40 and average last 10)

taoleicn · 2018-04-09T19:37:28Z

thanks @vince62s .
warmup 8000 and/or sparseadam could be the reason, as i'm getting lower BLEU before epoch 20 as well.

SkyAndCloud · 2018-04-11T01:51:40Z

thanks @srush @vince62s
I have a naive question here that whether if we need to tokenize plain corpus before we do bpe or sentencepiece? I read this script. It seems that opennmt-tf doesn't do any tokenization before sentencepiece. I totally agree with this operation because I think either bpe or word piece is a sub-word-unit method which views text as an unicode sequence and iterates from single character. so it should work well on untokenized corpus. However, I also saw tokenization before bpe/word-piece operation, I feel puzzled. Could you please help me or show me which is better?

vince62s · 2018-04-11T07:15:39Z

Read the SP doc: https://github.com/google/sentencepiece
not obvious but SP does white space pretokenization by default.

livenletdie · 2018-06-05T17:33:55Z

Thanks for the detailed instructions. I was able to train the transformer model to get ~4 validation perplexity. However, when running the translate command, it is erring out for me. Any pointers on what could be going wrong would be helpful.

Command I ran:

python translate.py -model model_acc_70.55_ppl_4.00_e50.pt -src ../../wmt14-ende/test.en -tgt ../../wmt14-ende/test.de -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu

Log:

Loading model parameters.
average src size 25.79294274300932 3004
/home/gavenkatesh/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py:321: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
return Variable(arr, volatile=not train), lengths
/home/gavenkatesh/anaconda3/lib/python3.6/site-packages/torchtext/data/field.py:322: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
return Variable(arr, volatile=not train)
/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/decoders/transformer.py:266: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
volatile=True)
Traceback (most recent call last):
File "translate.py", line 32, in
main(opt)
File "translate.py", line 21, in main
attn_debug=opt.attn_debug)
File "/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/translate/translator.py", line 176, in translate
batch_data = self.translate_batch(batch, data)
File "/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/translate/translator.py", line 356, in translate_batch
ret["gold_score"] = self._run_target(batch, data)
File "/raid/gavenkatesh/deepLearning/networks/opennmtPyT04/OpenNMT-py/onmt/translate/translator.py", line 405, in _run_target
gold_scores += scores
RuntimeError: expand(torch.FloatTensor{[30, 1]}, size=[30]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)

livenletdie · 2018-06-05T18:24:04Z

Update: When I dont specify -tgt it does run to completion fine.

xuanqing94 · 2018-10-25T21:27:01Z

Yes, happy to. We are planning on posting these models and more details

It would be great if you could reproduce.

Preprocessing is the same as here: https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
26.7
It should be the same. My command:

python  train.py -data /tmp/de2/data -save_model /tmp/extra -gpuid 1 \                                                                                                                                                                                                                                                                               
        -layers 6 -rnn_size 512 -word_vec_size 512 -batch_type tokens -batch_size 4096 \                                                                                                                          
        -epochs 50  -max_generator_batches 32 -normalization tokens -dropout 0.1 -accum_count 4 \                                                                                                                 
        -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer \                                                                                                                  
        -position_encoding -param_init 0 -warmup_steps 16000 -learning_rate 2 -param_init_glorot \                                                                                                                
        -start_checkpoint_at 5 -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 -report_every 1000

@srush I think in the original implementation (tensorflow_models and tensor2tensor), they use -share_embedding as well as -share_decoder_embeddings ?

mjc14 · 2018-11-04T09:01:46Z

@vince62s
hi, i use your dataset to reproduce your results, but i get a better result, BLEU = 30.31 on testset , is something wrong with me ? i use slrum to control my jobs ,so i change your some codes about distribution training ,but i did not change the model structure and the update method of gradients.

dataset:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

my commands:
python preprocess.py -train_src ./data/wmt_ende/train.en
-train_tgt ./data/wmt_ende/train.de
-valid_src ./data/wmt_ende/valid.en -valid_tgt ./data/wmt_ende/valid.de
-save_data ./data/wmt_ende/processed
-src_seq_length 100 -tgt_seq_length 100
-shard_size 200000000 -share_vocab

python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 200000 -max_generator_batches 2 -dropout 0.1
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 8 -gpu_ranks 0 1 2 3

python translate.py -model models/wmt_ende_step_30000.pt -src data/wmt_ende/test.en -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu -report_bleu -tgt data/wmt_ende/test.de
result:
PRED AVG SCORE: -0.1594, PRED PPL: 1.1728
GOLD AVG SCORE: -1.6642, GOLD PPL: 5.2814

BLEU = 30.31, 59.5/36.3/24.8/17.5 (BP=0.975, ratio=0.975, hyp_len=81681, ref_len=83752)

mjc14 · 2018-11-04T09:12:32Z

you are scoring pieces not words.

i am fresh about translation, how can i do this ? thx.

vince62s · 2018-11-04T09:20:01Z

https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh#L35-L41

mjc14 · 2018-11-04T09:46:13Z

you are so nice, thx

vince62s · 2018-11-04T09:48:44Z

you are scoring pieces not words.

ZacharyWaseda · 2018-12-27T05:23:09Z

@vince62s
hi, i use your dataset to reproduce your results, but i get a better result, BLEU = 30.31 on testset , is something wrong with me ? i use slrum to control my jobs ,so i change your some codes about distribution training ,but i did not change the model structure and the update method of gradients.

dataset:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

my commands:
python preprocess.py -train_src ./data/wmt_ende/train.en
-train_tgt ./data/wmt_ende/train.de
-valid_src ./data/wmt_ende/valid.en -valid_tgt ./data/wmt_ende/valid.de
-save_data ./data/wmt_ende/processed
-src_seq_length 100 -tgt_seq_length 100
-shard_size 200000000 -share_vocab

python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 200000 -max_generator_batches 2 -dropout 0.1
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 8 -gpu_ranks 0 1 2 3

python translate.py -model models/wmt_ende_step_30000.pt -src data/wmt_ende/test.en -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu -report_bleu -tgt data/wmt_ende/test.de
result:
PRED AVG SCORE: -0.1594, PRED PPL: 1.1728
GOLD AVG SCORE: -1.6642, GOLD PPL: 5.2814

BLEU = 30.31, 59.5/36.3/24.8/17.5 (BP=0.975, ratio=0.975, hyp_len=81681, ref_len=83752)

hi @mjc14 , I want to ask your new bleu result after scoring words using your listed params. thx a lot!

ZacharyWaseda · 2018-12-27T05:32:51Z

https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh#L35-L41

@vince62s Can I just detokenize the translation result by
line = line.replace(" ", "").replace("_", " ") ?
thx!

vince62s · 2018-12-27T05:54:02Z

no you need to use spm_decode
https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh

ZacharyWaseda · 2018-12-27T06:31:28Z

no you need to use spm_decode
https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh

you are so nice, many thx.

ZacharyWaseda · 2018-12-27T11:44:55Z

no you need to use spm_decode
https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/eval_wmt_ende.sh

hi @vince62s, I tried spm_decode, but it did not solve my problem.
I want to ask 1 epoch equals to how many steps with batch_size = 4096 using your processed training dataset.

I trained the transformer model using the following params, but my BLEU value is only around 21. The only difference is: I set train_steps as 200000, while you set epochs as 50. Maybe 1 epoch = 20000 steps, so 200000 steps only equals to 10 epochs? I am not sure about this.

python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende 
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 
-encoder_type transformer -decoder_type transformer -position_encoding 
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 
-max_grad_norm 0 -param_init 0 -param_init_glorot 
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 
-world_size 1 -gpu_ranks 0

btw, did you shuffle when you preprocessed the training dataset?
I did not find out why my experiment result is so bad. Please help! thx...

ZacharyWaseda · 2018-12-28T06:51:16Z

Hi @vince62s

I'm getting a bit lower BLEU after following the suggestions in this thread. Could you take a look and see if we are using the same config?

After 21 epochs, my 6-layer transformer model gets 26.02 / 27.21 on valid and test set. and I guess you got ~26.4 / 27.8?

Here are the commands I used for preprocessing, training and evaluation

preprocessing (-share_vocab, sequence length 100):
python preprocess.py -train_src ../wmt-en-de/train.en.shuf 
    -train_tgt ../wmt-en-de/train.de.shuf 
    -valid_src ../wmt-en-de/valid.en -valid_tgt ../wmt-en-de/valid.de 
    -save_data ../wmt-en-de/processed 
    -src_seq_length 100 -tgt_seq_length 100 
    -max_shard_size 200000000 -share_vocab
training (bs=20k, warmup=16k):
python train.py -gpuid 0 -rnn_size 512 -word_vec_size 512 -batch_type tokens 
    -batch_size 5120 -accum_count 4 -epochs 50  -max_generator_batches 32 
    -normalization tokens -dropout 0.1 -max_grad_norm 0 -optim adam 
    -encoder_type transformer -decoder_type transformer -position_encoding 
    -param_init 0 -warmup_steps 16000 -learning_rate 2 -start_checkpoint_at 10 
    -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 
    -data ../wmt-en-de/processed -param_init_glorot -layers 6 -report_every 1000
translate (alpha=0.6):
python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu
evaluate:
perl tools/multi-bleu-detok.perl ~/wmt-en-de/valid.de.detok < t2048.e21.1.out.detok
perl tools/multi-bleu-detok.perl ~/wmt-en-de/newstest2017-ende-ref.de < t2048.e21.2.out.detok
I'm using the sentence piece model wmtende.model for de-tokenization.

Thanks!

hi @taoleicn, what does bs in bs=20k mean? btw, how many steps do 50 epochs equal to with batch_size = 5120? thx.

ZacharyWaseda · 2018-12-28T06:55:36Z

Hi all, up to now I can only reproduce to Bleu 25.00. Could someone help?
My params are as follows:

python train.py -world_size 1 -gpu_ranks 0 -rnn_size 512 -word_vec_size 512 -batch_type tokens 
    -batch_size 4096 -accum_count 4 -train_steps 70000  -max_generator_batches 32 
    -normalization tokens -dropout 0.1 -max_grad_norm 0 -optim sparseadam 
    -encoder_type transformer -decoder_type transformer -position_encoding 
    -param_init 0 -warmup_steps 8000 -learning_rate 2 -decay_method noam -label_smoothing 0.1 
    -adam_beta2 0.998 -data ../wmt-en-de/processed -param_init_glorot -layers 6 
    -transformer_ff 2048 -heads 8

ZacharyWaseda · 2018-12-29T07:46:01Z

hi, @vince62s I tried your uploaded pre-trained model: https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz, BLEU = 28.0
I trained Transformer with the same params except epoch = 15, and I ensemble the last 5 checkpoints to evaluate, now my BLEU is only 25.69. However, in my case when epoch = 7, BLEU has already been 25.00.
Do you think it is promising that I could reproduce to your BLEU value if I train the model using 50 epochs? Maybe it will take me another 2 days lol. waiting for your kind reply, thx!

vince62s · 2018-12-29T08:28:14Z

What test set are you scoring ?

ZacharyWaseda · 2018-12-29T08:49:23Z

What test set are you scoring ?

@vince62s I use test.en & test.de as test dataset in your given link: https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
I think it should be newstest2017, in which there are 3004 sentence pairs.

vince62s · 2018-12-29T08:57:29Z

At 70000 steps it should be above 27 already.
I'll see if I can run it.

ZacharyWaseda · 2018-12-29T09:04:57Z

At 70000 steps it should be above 27 already.
I'll see if I can run it.

@vince62s I shuffled the training dataset before doing preprocess.py. Besides, I set seq_length as 150.
Do you think it could be the reason?

python preprocess.py -train_src ./data/wmt_ende/train.en.shuf 
-train_tgt ./data/wmt_ende/train.de.shuf 
-valid_src ./data/wmt_ende/valid.en -valid_tgt ./data/wmt_ende/valid.de 
-save_data ./data/wmt_ende/processed 
-src_seq_length 150 -tgt_seq_length 150 -share_vocab

ZacharyWaseda · 2018-12-29T09:07:23Z

At 70000 steps it should be above 27 already.
I'll see if I can run it.

BUT in #1093, use seq_length as 150 instead of 100 would lead to a much better result. it makes me confused.
So maybe seq_length is not the reason? I am not sure.

ZacharyWaseda · 2018-12-29T09:28:36Z

At 70000 steps it should be above 27 already.
I'll see if I can run it.

Maybe there are some bugs in translate process. I observed that in most translation sentences, they end with .. :: or ."., sometimes there are several words repeat 2 or 3 times in one sentence.

vince62s · 2018-12-29T09:37:19Z

are you on master ?

ZacharyWaseda · 2018-12-29T11:31:46Z

are you on master ?

yes, just not up-to-date.
i will git pull and train again.
thx for your patience and help!!!

vince62s · 2018-12-31T08:12:18Z

I an currently running the following system:
ENDE with Europarl+CommonC rawl+NewsDiscuss-v13
6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.
(when you have accum_count 4 on 1 GPU)
warmup_steps 6k
All other params are the same.
At 30k steps: NT2017: 27.15 NT2018: 40.18
My 30k steps should be quite equivalent to 90k steps of your run.
I switched to the onmt-tokenizer but that does not make any difference compared to sentence piece.

My translate command line is:
python3 translate.py -gpu 0 -fast -model exp/$model
-src $TEST_SRC -output $TEST_HYP -beam_size $bs -batch_size 32

ZacharyWaseda · 2019-01-02T02:50:08Z

I an currently running the following system:
ENDE with Europarl+CommonC rawl+NewsDiscuss-v13
6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.
(when you have accum_count 4 on 1 GPU)
warmup_steps 6k
All other params are the same.
At 30k steps: NT2017: 27.15 NT2018: 40.18
My 30k steps should be quite equivalent to 90k steps of your run.
I switched to the onmt-tokenizer but that does not make any difference compared to sentence piece.

My translate command line is:
python3 translate.py -gpu 0 -fast -model exp/$model
-src $TEST_SRC -output $TEST_HYP -beam_size $bs -batch_size 32

@vince62s many thanks for verifying the code.

I used your given link: https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
The training dataset contains Europarl-V7+CommonCrawl+NewsCommentary11, AND the last one is different from your last experiment.
I use 1 GPU with Accum_count 4, batch_size 4096 tokens => 16384 tokens per true batch.
warmup_steps 8k
This is my experiment results:
At 90k steps: NT2017: 26.16
At 100k steps: NT2017: 26.36
At 150k steps: NT2017: 26.03
At 200k steps: NT2017: 26.79
At 250k steps: NT2017: 26.71
At 300k steps: NT2017: 27.44
At 390k steps: NT2017: 27.49
Now our differences only exist in warmup_steps, true batch size, and training corpus. Do you think I should set Accum_count as 12 so that my true batch will also contain 49152 tokens? In my experience, true batch size would greatly affect the experiment result.
Btw, when you did preprocess process, did you shuffle? what is the number of src_seq_length & tgt_seq_length, did you set 100 or 150?

vince62s · 2019-01-02T07:42:55Z

NewsC-v11 or 13 or not so different, impact should be minimal.
I use 100 as a seq_length.
Your results are reasonable. I also struggled a bit when using only one GPU to find the proper learning rate and warmup steps set up.
As a sanity check you can compare the output from my uploaded model and yours to see if there is any obvious issue.

ZacharyWaseda · 2019-01-11T02:35:08Z

I an currently running the following system:
ENDE with Europarl+CommonC rawl+NewsDiscuss-v13
6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.
(when you have accum_count 4 on 1 GPU)
warmup_steps 6k
All other params are the same.
At 30k steps: NT2017: 27.15 NT2018: 40.18
My 30k steps should be quite equivalent to 90k steps of your run.
I switched to the onmt-tokenizer but that does not make any difference compared to sentence piece.

My translate command line is:
python3 translate.py -gpu 0 -fast -model exp/$model
-src $TEST_SRC -output $TEST_HYP -beam_size $bs -batch_size 32

hi @vince62s, I am not familiar with multi-gpu schedular. But why "6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.". I thought true batch should be 4096 * 2 tokens, which is determined by Accum_count. 6-gpu only speed up the computation process, and not affect true batch size. Is my understanding right? thx

vince62s · 2019-01-11T07:54:27Z

no, in sync training, we send 4096 tokens on each gpu, calculate the gradients, and gather everything before updating parameters.
when accum=2, we wait twice the above before updating, hence true batch is 6x4096x2.

ZacharyWaseda · 2019-01-17T11:40:10Z

no, in sync training, we send 4096 tokens on each gpu, calculate the gradients, and gather everything before updating parameters.
when accum=2, we wait twice the above before updating, hence true batch is 6x4096x2.

hi @vince62s , how many hours it takes to train on 6 gpus for 30,000 steps with your mentioned params. thx. Did the process of distributing and gathering cost a lot of time?

vince62s · 2019-01-17T12:51:29Z

About 2 1/2 hours per 10K steps

ZacharyWaseda · 2019-01-17T13:01:00Z

About 2 1/2 hours per 10K steps

So 50 steps only cost 9 seconds. That's so fast.
Can I ask about your GPU version, Tesla P40, P100 or some other type?

XinDongol · 2019-03-13T00:05:24Z

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz

When preprocessing, make sure you use a sequence length of 100 and -share_vocab

Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

ZacharyWaseda · 2019-03-13T02:12:16Z

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
When preprocessing, make sure you use a sequence length of 100 and -share_vocab
Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

I think so.

XinDongol · 2019-03-13T02:21:20Z

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
When preprocessing, make sure you use a sequence length of 100 and -share_vocab
Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

I think so.

Then, we can do pre-processing to get the ready-to-use dataset.

python /OpenNMT-py/preprocess.py -train_src /datasets/wmt/train.en  \
-train_tgt /datasets/wmt/train.de \
-valid_src/datasets/wmt/valid.en -valid_tgt /datasets/wmt/valid.de \
-save_data /datasets/wmt/processed \
-src_seq_length 100 -tgt_seq_length 100 \
-shard_size 200000000 -share_vocab \

Is it correct ?

ZacharyWaseda · 2019-03-13T02:55:23Z

If you want to compare, here is the link of the trained model:
https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz
And as @srush says, here is the dataset link:
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz
When preprocessing, make sure you use a sequence length of 100 and -share_vocab
Cheers.

https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz.
Is it the output files of OpenNMT-tf/scripts/wmt/prepare_data.sh ?

I think so.

Then, we can do pre-processing to get the ready-to-use dataset.
python /OpenNMT-py/preprocess.py -train_src /datasets/wmt/train.en  \
-train_tgt /datasets/wmt/train.de \
-valid_src/datasets/wmt/valid.en -valid_tgt /datasets/wmt/valid.de \
-save_data /datasets/wmt/processed \
-src_seq_length 100 -tgt_seq_length 100 \
-shard_size 200000000 -share_vocab \
Is it correct ?

I did not set this param "-shard_size 200000000"

vince62s · 2019-05-31T08:18:00Z

@Dhanasekar-S I answered to you by email.
You need to fully understand the concepts, not just apply recipes.
Do not mix up data preparation and preprocessing.
Data preparation is the same as in the -tf script https://github.com/OpenNMT/OpenNMT-tf/blob/master/scripts/wmt/prepare_data.sh
it will build a sentence piece model and tokenize your data using the SP model.

Once you have tokenized data, you can preprocess them to prepare the .pt pickle files.
As written in the email, use sequence length of 100 for both src and tgt, and use share_vocab.

vince62s closed this as completed Aug 2, 2018

wonkeelee mentioned this issue Oct 28, 2018

SparseAdam does not support with share_decoder_embeddings option #1015

Closed

sedrickkeh mentioned this issue Sep 6, 2020

Recreating WMT14 EN-DE transformer results #1862

Closed

Yuran-Zhao mentioned this issue Jan 28, 2021

How to reproduce the result on WMT14 DE-EN? #2003

Closed

chijianlei mentioned this issue Apr 15, 2021

The -src_seq_length parameter is not working? #2040

Closed

How to reproduce your results on WMT'14 ENDE datasets of "Attention is All You Need"? #637

How to reproduce your results on WMT'14 ENDE datasets of "Attention is All You Need"? #637

Comments

SkyAndCloud commented Mar 26, 2018

srush commented Mar 26, 2018

SkyAndCloud commented Mar 26, 2018

vince62s commented Mar 26, 2018

Epsilon-Lee commented Apr 7, 2018

vince62s commented Apr 7, 2018

taoleicn commented Apr 9, 2018 • edited Loading

vince62s commented Apr 9, 2018

taoleicn commented Apr 9, 2018

SkyAndCloud commented Apr 11, 2018

vince62s commented Apr 11, 2018

livenletdie commented Jun 5, 2018

Command I ran:

Log:

livenletdie commented Jun 5, 2018

xuanqing94 commented Oct 25, 2018 • edited Loading

mjc14 commented Nov 4, 2018

mjc14 commented Nov 4, 2018

vince62s commented Nov 4, 2018

mjc14 commented Nov 4, 2018

vince62s commented Nov 4, 2018

ZacharyWaseda commented Dec 27, 2018 • edited Loading

ZacharyWaseda commented Dec 27, 2018 • edited Loading

vince62s commented Dec 27, 2018

ZacharyWaseda commented Dec 27, 2018

ZacharyWaseda commented Dec 27, 2018 • edited Loading

ZacharyWaseda commented Dec 28, 2018

ZacharyWaseda commented Dec 28, 2018 • edited Loading

ZacharyWaseda commented Dec 29, 2018 • edited Loading

vince62s commented Dec 29, 2018

ZacharyWaseda commented Dec 29, 2018 • edited Loading

vince62s commented Dec 29, 2018

ZacharyWaseda commented Dec 29, 2018

ZacharyWaseda commented Dec 29, 2018 • edited Loading

ZacharyWaseda commented Dec 29, 2018

vince62s commented Dec 29, 2018

ZacharyWaseda commented Dec 29, 2018

vince62s commented Dec 31, 2018

ZacharyWaseda commented Jan 2, 2019 • edited Loading

vince62s commented Jan 2, 2019

ZacharyWaseda commented Jan 11, 2019

vince62s commented Jan 11, 2019 • edited Loading

ZacharyWaseda commented Jan 17, 2019 • edited Loading

vince62s commented Jan 17, 2019

ZacharyWaseda commented Jan 17, 2019 • edited Loading

XinDongol commented Mar 13, 2019

ZacharyWaseda commented Mar 13, 2019

XinDongol commented Mar 13, 2019

ZacharyWaseda commented Mar 13, 2019

vince62s commented May 31, 2019

taoleicn commented Apr 9, 2018 •

edited

Loading

xuanqing94 commented Oct 25, 2018 •

edited

Loading

ZacharyWaseda commented Dec 27, 2018 •

edited

Loading

ZacharyWaseda commented Dec 27, 2018 •

edited

Loading

ZacharyWaseda commented Dec 27, 2018 •

edited

Loading

ZacharyWaseda commented Dec 28, 2018 •

edited

Loading

ZacharyWaseda commented Dec 29, 2018 •

edited

Loading

ZacharyWaseda commented Dec 29, 2018 •

edited

Loading

ZacharyWaseda commented Dec 29, 2018 •

edited

Loading

ZacharyWaseda commented Jan 2, 2019 •

edited

Loading

vince62s commented Jan 11, 2019 •

edited

Loading

ZacharyWaseda commented Jan 17, 2019 •

edited

Loading

ZacharyWaseda commented Jan 17, 2019 •

edited

Loading