The low score problem in transformer wmt32k #455

liesun1994 · 2017-12-04T01:37:29Z

Hello
I am using the early version 1.2.9 of tensor2tensor. And I am trying to reproduce wmt results . When I am using two gpus and the following configuration , newstest2014 only got 23.50 bleus . The configuration is listed as follows:

# define the problem and model 
PROBLEM=translate_ende_wmt32k
MODEL=transformer
HPARAMS=transformer_base_single_gpu

DATA_DIR=/data1/kwang/t2t/t2t_data/wmt32k_long
TMP_DIR=/data1/kwang/t2t_datagen/wmt32k
TRAIN_DIR=/data1/kwang/t2t/t2t_train/wmt32k_long/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM


WORKER_GPU=2
WORKER_GPU_MEMORY_FRACTION=0.95

# Train
# *  If you run out of memory, add --hparams='batch_size=1024'.
export CUDA_VISIBLE_DEVICES=0,1 
t2t-trainer \
  --train_steps=400000 \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --worker_gpu=$WORKER_GPU \
  --worker_gpu_memory_fraction=$WORKER_GPU_MEMORY_FRACTION

I am using transformer_base_sigle_gpu and train_steps=400000，the model is trained for tokenized data set . All the other parameters are set to default . Does the number of GPUs have a great influence on the wmt results ? Or anything wrong with my configuration ? Thanks .

The text was updated successfully, but these errors were encountered:

liesun1994 · 2017-12-04T01:43:15Z

I am testing transformer with other dataset too . In LDC Chinese to English dataset , RNN with attention model got 34.5 bleus （on top of open source code） , and transformer model got 41.57（tensor2tensor）, which improves the LDC CH-EN results .

martinpopel · 2017-12-04T13:04:41Z

Does the number of GPUs have a great influence on the wmt results ?

It seem that yes, see #444.
It also depends on the batch_size. The current default value for transformer_base_single_gpu is 2048, but the Attention is all you need paper reports 25000. This means you need to train for at least 12 times more steps than in the paper to get comparable results. And if you have just two GPUs instead of eight, you need to train 4 times longer. So considering both, you need to train 48 times longer: in the paper they used 100k steps for the base models, so you should use 4.8M steps.

Or anything wrong with my configuration ?

worker_gpu_memory_fraction is 0.95 by default.
And don't forget to do checkpoint averaging for the best results.

liesun1994 · 2017-12-05T07:15:32Z

@martinpopel Thanks for your quickly reply . Devices possible , I will try it , and report the new result .

liesun1994 · 2017-12-13T03:37:10Z

@martinpopel I tried it , the bpe result on newstest2014 got 26.07 score (3 GPUs and batch_size=3072). The training is not ended . So , have you extract transformer as a separate module ? If so , is it possible to send the separate modult to me ?

martinpopel · 2017-12-13T12:31:30Z

@liesun1994: I don't understand what do you mean by "extract transformer as a separate module". For training, I am using T2T without any changes.

but the Attention is all you need paper reports 25000.
This means you need to train for at least 12 times more steps than in the paper to get comparable results.

As I think about it now, I must correct my previous claim.
In the paper, they use eight NVIDIA P100 GPUs and I think each has 16 GB memory.
It is highly unlikely, you could fit 25k-subwords batch with transformer_big model into 16 GB.
It is more likely, the batch size per one GPU was about 3072, and thus the effective batch size for all eight GPUs is about 24576, which is what is reported in the paper as 25000.
@lukaszkaiser: Can you please confirm this?

liesun1994 · 2017-12-13T14:11:13Z

Sorry for my poor English 😆 , t2t contains a lot of tasks , and transformer may be the module we wanted .

liesun1994 · 2017-12-13T14:16:08Z

Just because the code is difficult to understand and modified .

njoe9 · 2017-12-14T01:03:53Z

Hi, @martinpopel
I have a question.
And how to do checkpoint averaging for the best results?
Thanks.

martinpopel · 2017-12-14T01:33:44Z

@njoe9: I save checkpoints each hour and average 8 or 16 (or even 32) checkpoints. In the early training phases it is better to average less checkpoints (probably because the older checkpoints produce notably worse BLEU than the most recent checkpoint).

BTW: I think the discussion is diverging. The original question has been answered (the number of GPUs and batch size are important when comparing results after a given number of steps), so @liesun1994 can close this issue, to keep the list of open issues tidy.

liesun1994 · 2017-12-14T02:07:50Z

@njoe9 I think the script can solve your problem. https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py #317

njoe9 · 2017-12-25T08:28:23Z

Thanks. @martinpopel @liesun1994

xuekun90 · 2018-07-23T05:38:12Z

@liesun1994 , as you mentioned In LDC Chinese to English dataset , transformer model got 41.57(tensor2tensor), wow, I only got 21.43... Could you pls share the configuration when you train LDC dataset? Thanks.

echan00 · 2018-11-03T04:32:40Z

@liesun1994 same here, quite interested in learning about what you did with the LDC Chinese to English dataset. Where can I download a copy as well?

liesun1994 closed this as completed Dec 14, 2017

liesun1994 reopened this Dec 14, 2017

liesun1994 closed this as completed Dec 14, 2017

martinpopel mentioned this issue Dec 15, 2017

[Tuning] Results are GPU-number and batch-size dependent #444

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The low score problem in transformer wmt32k #455

The low score problem in transformer wmt32k #455

liesun1994 commented Dec 4, 2017

liesun1994 commented Dec 4, 2017

martinpopel commented Dec 4, 2017

liesun1994 commented Dec 5, 2017

liesun1994 commented Dec 13, 2017

martinpopel commented Dec 13, 2017

liesun1994 commented Dec 13, 2017

liesun1994 commented Dec 13, 2017

njoe9 commented Dec 14, 2017

martinpopel commented Dec 14, 2017

liesun1994 commented Dec 14, 2017

njoe9 commented Dec 25, 2017

xuekun90 commented Jul 23, 2018

echan00 commented Nov 3, 2018

The low score problem in transformer wmt32k #455

The low score problem in transformer wmt32k #455

Comments

liesun1994 commented Dec 4, 2017

liesun1994 commented Dec 4, 2017

martinpopel commented Dec 4, 2017

liesun1994 commented Dec 5, 2017

liesun1994 commented Dec 13, 2017

martinpopel commented Dec 13, 2017

liesun1994 commented Dec 13, 2017

liesun1994 commented Dec 13, 2017

njoe9 commented Dec 14, 2017

martinpopel commented Dec 14, 2017

liesun1994 commented Dec 14, 2017

njoe9 commented Dec 25, 2017

xuekun90 commented Jul 23, 2018

echan00 commented Nov 3, 2018