Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reproduce the result on WMT14 DE-EN? #2003

Closed
Yuran-Zhao opened this issue Jan 28, 2021 · 7 comments
Closed

How to reproduce the result on WMT14 DE-EN? #2003

Yuran-Zhao opened this issue Jan 28, 2021 · 7 comments

Comments

@Yuran-Zhao
Copy link

I've been tried to reproduce the result on WMT14 DE-EN currently. According to the paper "Attention is All You Need", the transformer model should achieve 27.3 BLEU on newstest2014. But I only got 17.2 by sacrebleu.

I have a look at #637 and #1862. However, some commands and parameters are eliminated in OpenNMT-py 2.0, which makes me a little confused.

The commands I used are as follows:

1. Prepare the data

I used the script here https://github.com/OpenNMT/OpenNMT-py/blob/master/examples/scripts/prepare_wmt_data.sh

#!/bin/bash

##################################################################################
# The default script downloads the commoncrawl, europarl and newstest2014 and
# newstest2017 datasets. Files that are not English or German are removed in
# this script for tidyness.You may switch datasets out depending on task.
# (Note that commoncrawl europarl-v7 are the same for all tasks).
# http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
# http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz
#
# WMT14 http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz
# WMT15 http://www.statmt.org/wmt15/training-parallel-nc-v10.tgz
# WMT16 http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
# WMT17 http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz
# Note : there are very little difference, but each year added a few sentences
# new WMT17 http://data.statmt.org/wmt17/translation-task/rapid2016.tgz
#
# For WMT16 Rico Sennrich released some News back translation
# http://data.statmt.org/rsennrich/wmt16_backtranslations/en-de/
#
# Tests sets: http://data.statmt.org/wmt17/translation-task/test.tgz
##################################################################################

# provide script usage instructions
if [ $# -eq 0 ]
then
    echo "usage: $0 <data_dir>"
    exit 1
fi

# set relevant paths
SP_PATH=/usr/local/bin
DATA_PATH=$1
TEST_PATH=$DATA_PATH/test

CUR_DIR=$(pwd)

# set vocabulary size and source and target languages
vocab_size=32000
sl=de
tl=en

# Download the default datasets into the $DATA_PATH; mkdir if it doesn't exist
mkdir -p $DATA_PATH
cd $DATA_PATH

echo "Downloading and extracting Commoncrawl data (919 MB) for training..."
wget --trust-server-names http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
tar zxvf training-parallel-commoncrawl.tgz
ls | grep -v 'commoncrawl.de-en.[de,en]' | xargs rm

echo "Downloading and extracting Europarl data (658 MB) for training..."
wget --trust-server-names http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz
tar zxvf training-parallel-europarl-v7.tgz
cd training && ls | grep -v 'europarl-v7.de-en.[de,en]' | xargs rm
cd .. && mv training/europarl* . && rm -r training training-parallel-europarl-v7.tgz

echo "Downloading and extracting News Commentary data (76 MB) for training..."
wget --trust-server-names http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
tar zxvf training-parallel-nc-v11.tgz
cd training-parallel-nc-v11 && ls | grep -v news-commentary-v11.de-en.[de,en] | xargs rm
cd .. && mv training-parallel-nc-v11/* . && rm -r training-parallel-nc-v11 training-parallel-nc-v11.tgz

Validation and test data are put into the $DATA_PATH/test folder
echo "Downloading and extracting newstest2014 data (4 MB) for validation..."
wget --trust-server-names http://www.statmt.org/wmt14/test-filtered.tgz
echo "Downloading and extracting newstest2017 data (5 MB) for testing..."
wget --trust-server-names http://data.statmt.org/wmt17/translation-task/test.tgz
tar zxvf test-filtered.tgz && tar zxvf test.tgz
cd test && ls | grep -v '.*deen\|.*ende' | xargs rm
cd .. && rm test-filtered.tgz test.tgz && cd ..

# set training, validation, and test corpuses
corpus[1]=commoncrawl.de-en
corpus[2]=europarl-v7.de-en
corpus[3]=news-commentary-v11.de-en
#corpus[3]=news-commentary-v12.de-en
#corpus[4]=news.bt.en-de
#corpus[5]=rapid2016.de-en

validset=newstest2014-deen
testset=newstest2017-ende

cd $CUR_DIR

retrieve file preparation from Moses repository
wget -nc \
 		https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/ems/support/input-from-sgm.perl \
 		-O $TEST_PATH/input-from-sgm.perl

##################################################################################
# Starting from here, original files are supposed to be in $DATA_PATH
# a data folder will be created in scripts/wmt
##################################################################################

export PATH=$SP_PATH:$PATH

# Data preparation using SentencePiece
# First we concat all the datasets to train the SP model
if true; then
 echo "$0: Training sentencepiece model"
 rm -f $DATA_PATH/train.txt
 for ((i=1; i<= ${#corpus[@]}; i++))
 do
  for f in $DATA_PATH/${corpus[$i]}.$sl $DATA_PATH/${corpus[$i]}.$tl
   do
    cat $f >> $DATA_PATH/train.txt
   done
 done
 spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl \
           --vocab_size=$vocab_size --character_coverage=1
 rm $DATA_PATH/train.txt
fi

# Second we use the trained model to tokenize all the files
# This is not necessary, as it can be done on the fly in OpenNMT-py 2.0
# if false; then
#  echo "$0: Tokenizing with sentencepiece model"
#  rm -f $DATA_PATH/train.txt
#  for ((i=1; i<= ${#corpus[@]}; i++))
#  do
#   for f in $DATA_PATH/${corpus[$i]}.$sl $DATA_PATH/${corpus[$i]}.$tl
#    do
#     file=$(basename $f)
#     spm_encode --model=$DATA_PATH/wmt$sl$tl.model < $f > $DATA_PATH/$file.sp
#    done
#  done
# fi

# We concat the training sets into two (src/tgt) tokenized files
# if false; then
#  cat $DATA_PATH/*.$sl.sp > $DATA_PATH/train.$sl
#  cat $DATA_PATH/*.$tl.sp > $DATA_PATH/train.$tl
# fi

#  We use the same tokenization method for a valid set (and test set)
# if true; then
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-src.$sl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/valid.$sl.sp
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-ref.$tl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/valid.$tl.sp
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-src.$sl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/test.$sl.sp
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-ref.$tl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/test.$tl.sp
# fi

# Parse the valid and test sets
if true; then
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-src.$sl.sgm \
    > $DATA_PATH/valid.$sl
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-ref.$tl.sgm \
    > $DATA_PATH/valid.$tl
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-ref.$sl.sgm \
    > $DATA_PATH/test.$sl
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-src.$tl.sgm \
    > $DATA_PATH/test.$tl
fi

And the command is ./prepare_wmt_data.sh ../../onmt_data/wmt14-de-en.

2. Build the vocabulary

onmt_build_vocab -config wmt14-de-en.yml -n_sample -1

3. Train the model

python train.py --config ./examples/scripts/wmt14-de-en.yml

The contents in the wmt14-de-en.yml are:

save_data: /OpenNMT-py/onmt_data/wmt14-de-en/run/example
# Where the vocab(s) will be written
src_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src
tgt_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src

# Corpus opts:
data:
    commoncrawl:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/commoncrawl.de-en.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/commoncrawl.de-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 23
    europarl:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/europarl-v7.de-en.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/europarl-v7.de-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 19
    news_commentary:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/news-commentary-v11.de-en.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/news-commentary-v11.de-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 3
    valid:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/valid.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/valid.en
        transforms: [sentencepiece]

        
### Transform related opts:
#### Subword
src_subword_model: /OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model
tgt_subword_model: /OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0

#### Filter
src_seq_length: 100
tgt_seq_length: 100

# silently ignore empty lines in the data
skip_empty_level: silent

# # Vocab opts
# ### vocab:
src_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src
tgt_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src
src_vocab_size: 32000
tgt_vocab_size: 32000
vocab_size_multiple: 8
# src_words_min_frequency: 1
# tgt_words_min_frequency: 1
share_vocab: True

# # Model training parameters

# General opts
save_model: ./onmt_data/wmt14-de-en/run/model
keep_checkpoint: 50
save_checkpoint_steps: 10000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 200000
valid_steps: 10000

# Batching
queue_size: 10000
bucket_size: 32768
# pool_factor: 8192
world_size: 2
gpu_ranks: [0, 1]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2.0
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true

4. Translate and evaluate

spm_encode --model=/OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model \
		< /OpenNMT-py/onmt_data/wmt14-de-en/test.en \
		> /OpenNMT-py/onmt_data/wmt14-de-en/test.en.sp
spm_encode --model=/OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model \
		< /OpenNMT-py/onmt_data/wmt14-de-en/test.de \
		> /OpenNMT-py/onmt_data/wmt14-de-en/test.de.sp

for checkpoint in /OpenNMT-py/onmt_data/wmt14-de-en/run/model_step*.pt; do
		echo "# Translating with checkpoint $checkpoint"
		base=$(basename $checkpoint)
		python ../../translate.py \
				-gpu 0 \
				-batch_size 16384 -batch_type tokens \
				-beam_size 5 \
				-model $checkpoint \
				-src /OpenNMT-py/onmt_data/wmt14-de-en/test.de.sp \
				-tgt /OpenNMT-py/onmt_data/wmt14-de-en/test.en.sp \
				-output /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}.sp
done

for checkpoint in /OpenNMT-py/onmt_data/wmt14-de-en/run/model_step*.pt; do
		base=$(basename $checkpoint)
		spm_decode \
				-model=/OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model \
				-input_format=piece \
				< /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}.sp \
				> /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}
done

for checkpoint in /OpenNMT-py/onmt_data/wmt14-de-en/run/model_step*.pt; do
		echo "$checkpoint"
		base=$(basename $checkpoint)
		sacrebleu /OpenNMT-py/onmt_data/wmt14-de-en/test.en < /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}
done

Is there something wrong?

@patrick-nanys
Copy link

Has there been any advancements on this? I'm having the same problem.

@Yuran-Zhao
Copy link
Author

Has there been any advancements on this? I'm having the same problem.

I used OpenNMT-py 1.x version to reproduce the result successfully. If you are interested, you can find it in my repository with a tutorial to reproduce it.

@chijianlei
Copy link

I have met the same problem, have you solved it in 2.0 version?

@Yuran-Zhao
Copy link
Author

I have met the same problem, have you solved it in 2.0 version?

No... actually, I want to give up on the 2.0 version :D.

@francoishernandez
Copy link
Member

francoishernandez commented Apr 16, 2021

FYI I just ran the 2.0 example from scratch with a fresh pip install OpenNMT-py==2.0.1), without touching anything. With the checkpoint at 75k steps, I get 25.7 BLEU on valid (newstest2014) and 27.0 on test (newstest2017).
Not sure what's going on with your 17.2, maybe some tokenization issues/mismatch?

@chijianlei
Copy link

I have successfully run the 2.0 example now, one issue is that I need to add -share_vocab parameter which the example does not contain.
I will report the result as soon as I complete the training.

@francoishernandez
Copy link
Member

Yes you're right I forgot to mention the vocab options may not be fully up to date in the example.
It would be great if you could open a PR to update the example with the working adaptation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants