-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to reproduce your results on WMT'14 ENDE datasets of "Attention is All You Need"? #637
Comments
Yes, happy to. We are planning on posting these models and more details It would be great if you could reproduce.
|
Thanks for you quick reply. I'll try to reproduce this result on OpenNMT-py. |
If you want to compare, here is the link of the trained model: When preprocessing, make sure you use a sequence length of 100 and -share_vocab Cheers. |
Hi Vincent @vince62s, I wonder how you preprocess the WMT14-EN-DE corpus? I download the data with your provided link.
Thanks very much! |
I used Sentence Piece to tokenize the corpus (instead of BPE) |
Hi @vince62s I'm getting a bit lower BLEU after following the suggestions in this thread. Could you take a look and see if we are using the same config? After 21 epochs, my 6-layer transformer model gets 26.02 / 27.21 on valid and test set. and I guess you got ~26.4 / 27.8? Here are the commands I used for preprocessing, training and evaluation preprocessing (
training (bs=20k, warmup=16k):
translate (alpha=0.6):
evaluate:
I'm using the sentence piece model Thanks! |
I used warmup 8000 and optim sparseadam. However as you can see in Issues/PR there is a bug Adam states are reset during a train_from It may have a very slight impact. I think if you let it go, you will end up with similar results to mine.(go up to 40 and average last 10) |
thanks @vince62s . |
thanks @srush @vince62s |
Read the SP doc: https://github.com/google/sentencepiece |
Thanks for the detailed instructions. I was able to train the transformer model to get ~4 validation perplexity. However, when running the translate command, it is erring out for me. Any pointers on what could be going wrong would be helpful. Command I ran:python translate.py -model model_acc_70.55_ppl_4.00_e50.pt -src ../../wmt14-ende/test.en -tgt ../../wmt14-ende/test.de -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu Log:Loading model parameters. |
Update: When I dont specify -tgt it does run to completion fine. |
@srush I think in the original implementation (tensorflow_models and tensor2tensor), they use -share_embedding as well as -share_decoder_embeddings ? |
@vince62s dataset: my commands: python train.py -data ./data/wmt_ende/processed -save_model ./models/wmt_ende python translate.py -model models/wmt_ende_step_30000.pt -src data/wmt_ende/test.en -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu -report_bleu -tgt data/wmt_ende/test.de
|
i am fresh about translation, how can i do this ? thx. |
you are so nice, thx |
you are scoring pieces not words. |
hi @mjc14 , I want to ask your new bleu result after scoring words using your listed params. thx a lot! |
@vince62s Can I just detokenize the translation result by |
no you need to use spm_decode |
you are so nice, many thx. |
hi @vince62s, I tried spm_decode, but it did not solve my problem. I trained the transformer model using the following params, but my BLEU value is only around 21. The only difference is: I set train_steps as 200000, while you set epochs as 50. Maybe 1 epoch = 20000 steps, so 200000 steps only equals to 10 epochs? I am not sure about this.
btw, did you shuffle when you preprocessed the training dataset? |
hi @taoleicn, what does |
Hi all, up to now I can only reproduce to Bleu 25.00. Could someone help?
|
hi, @vince62s I tried your uploaded pre-trained model: https://s3.amazonaws.com/opennmt-models/transformer-ende-wmt-pyOnmt.tar.gz, BLEU = 28.0 |
What test set are you scoring ? |
@vince62s I use |
At 70000 steps it should be above 27 already. |
@vince62s I shuffled the training dataset before doing preprocess.py. Besides, I set seq_length as 150.
|
BUT in #1093, use seq_length as 150 instead of 100 would lead to a much better result. it makes me confused. |
Maybe there are some bugs in translate process. I observed that in most translation sentences, they end with |
are you on master ? |
yes, just not up-to-date. |
I an currently running the following system: My translate command line is: |
@vince62s many thanks for verifying the code.
|
NewsC-v11 or 13 or not so different, impact should be minimal. |
hi @vince62s, I am not familiar with multi-gpu schedular. But why "6 GPU with Accum_count 2, batch_size 4096 tokens => 49152 token per true batch.". I thought true batch should be 4096 * 2 tokens, which is determined by Accum_count. 6-gpu only speed up the computation process, and not affect true batch size. Is my understanding right? thx |
no, in sync training, we send 4096 tokens on each gpu, calculate the gradients, and gather everything before updating parameters. |
hi @vince62s , how many hours it takes to train on 6 gpus for 30,000 steps with your mentioned params. thx. Did the process of distributing and gathering cost a lot of time? |
About 2 1/2 hours per 10K steps |
So 50 steps only cost 9 seconds. That's so fast. |
https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz. |
I think so. |
Then, we can do pre-processing to get the ready-to-use dataset.
Is it correct ? |
I did not set this param "-shard_size 200000000" |
@Dhanasekar-S I answered to you by email. Once you have tokenized data, you can preprocess them to prepare the .pt pickle files. |
Hi, I want to reproduce the results on WMT'14 ENDE datasets of "Attention is All You Need" paper. I have read OpenNMT-FAQ and I want to know the exact details about your experiments:
Thank you very much! @srush
The text was updated successfully, but these errors were encountered: