Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

The low score problem in transformer wmt32k #455

Closed
liesun1994 opened this issue Dec 4, 2017 · 13 comments
Closed

The low score problem in transformer wmt32k #455

liesun1994 opened this issue Dec 4, 2017 · 13 comments

Comments

@liesun1994
Copy link

Hello
I am using the early version 1.2.9 of tensor2tensor. And I am trying to reproduce wmt results . When I am using two gpus and the following configuration , newstest2014 only got 23.50 bleus . The configuration is listed as follows:

# define the problem and model 
PROBLEM=translate_ende_wmt32k
MODEL=transformer
HPARAMS=transformer_base_single_gpu

DATA_DIR=/data1/kwang/t2t/t2t_data/wmt32k_long
TMP_DIR=/data1/kwang/t2t_datagen/wmt32k
TRAIN_DIR=/data1/kwang/t2t/t2t_train/wmt32k_long/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM


WORKER_GPU=2
WORKER_GPU_MEMORY_FRACTION=0.95

# Train
# *  If you run out of memory, add --hparams='batch_size=1024'.
export CUDA_VISIBLE_DEVICES=0,1 
t2t-trainer \
  --train_steps=400000 \
  --data_dir=$DATA_DIR \
  --problems=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --worker_gpu=$WORKER_GPU \
  --worker_gpu_memory_fraction=$WORKER_GPU_MEMORY_FRACTION

I am using transformer_base_sigle_gpu and train_steps=400000,the model is trained for tokenized data set . All the other parameters are set to default . Does the number of GPUs have a great influence on the wmt results ? Or anything wrong with my configuration ? Thanks .

@liesun1994
Copy link
Author

I am testing transformer with other dataset too . In LDC Chinese to English dataset , RNN with attention model got 34.5 bleus (on top of open source code) , and transformer model got 41.57(tensor2tensor), which improves the LDC CH-EN results .

@martinpopel
Copy link
Contributor

Does the number of GPUs have a great influence on the wmt results ?

It seem that yes, see #444.
It also depends on the batch_size. The current default value for transformer_base_single_gpu is 2048, but the Attention is all you need paper reports 25000. This means you need to train for at least 12 times more steps than in the paper to get comparable results. And if you have just two GPUs instead of eight, you need to train 4 times longer. So considering both, you need to train 48 times longer: in the paper they used 100k steps for the base models, so you should use 4.8M steps.

Or anything wrong with my configuration ?

worker_gpu_memory_fraction is 0.95 by default.
And don't forget to do checkpoint averaging for the best results.

@liesun1994
Copy link
Author

@martinpopel Thanks for your quickly reply . Devices possible , I will try it , and report the new result .

@liesun1994
Copy link
Author

@martinpopel I tried it , the bpe result on newstest2014 got 26.07 score (3 GPUs and batch_size=3072). The training is not ended . So , have you extract transformer as a separate module ? If so , is it possible to send the separate modult to me ?

@martinpopel
Copy link
Contributor

@liesun1994: I don't understand what do you mean by "extract transformer as a separate module". For training, I am using T2T without any changes.

but the Attention is all you need paper reports 25000.
This means you need to train for at least 12 times more steps than in the paper to get comparable results.

As I think about it now, I must correct my previous claim.
In the paper, they use eight NVIDIA P100 GPUs and I think each has 16 GB memory.
It is highly unlikely, you could fit 25k-subwords batch with transformer_big model into 16 GB.
It is more likely, the batch size per one GPU was about 3072, and thus the effective batch size for all eight GPUs is about 24576, which is what is reported in the paper as 25000.
@lukaszkaiser: Can you please confirm this?

@liesun1994
Copy link
Author

Sorry for my poor English 😆 , t2t contains a lot of tasks , and transformer may be the module we wanted .

@liesun1994
Copy link
Author

Just because the code is difficult to understand and modified .

@njoe9
Copy link

njoe9 commented Dec 14, 2017

Hi, @martinpopel
I have a question.
And how to do checkpoint averaging for the best results?
Thanks.

@martinpopel
Copy link
Contributor

@njoe9: I save checkpoints each hour and average 8 or 16 (or even 32) checkpoints. In the early training phases it is better to average less checkpoints (probably because the older checkpoints produce notably worse BLEU than the most recent checkpoint).

BTW: I think the discussion is diverging. The original question has been answered (the number of GPUs and batch size are important when comparing results after a given number of steps), so @liesun1994 can close this issue, to keep the list of open issues tidy.

@liesun1994
Copy link
Author

@njoe9
Copy link

njoe9 commented Dec 25, 2017

Thanks. @martinpopel @liesun1994

@xuekun90
Copy link

@liesun1994 , as you mentioned In LDC Chinese to English dataset , transformer model got 41.57(tensor2tensor), wow, I only got 21.43... Could you pls share the configuration when you train LDC dataset? Thanks.

@echan00
Copy link

echan00 commented Nov 3, 2018

@liesun1994 same here, quite interested in learning about what you did with the LDC Chinese to English dataset. Where can I download a copy as well?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants