Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

TokenTextEncoder does not respect reserved tokens #365

Open
rsennrich opened this issue Oct 16, 2017 · 15 comments
Open

TokenTextEncoder does not respect reserved tokens #365

rsennrich opened this issue Oct 16, 2017 · 15 comments
Labels

Comments

@rsennrich
Copy link

rsennrich commented Oct 16, 2017

I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and the provided vocabulary file.

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .
INFO:tensorflow:Inference results OUTPUT: In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In
In In In In In

{'outputs': array([68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68], dtype=int32), 'problem_choice': 0, 'inputs': array([[ 68],
[ 35],
[ 2196],
[ 0],
[ 2],
[ 651],
[ 55],
[18587],
[15840],
[ 2],
[ 1874],
[ 1763],
[ 260],
[ 1],
[ 1],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0]], dtype=int32)}

The first 10 lines of vocab.bpe.32000 look like this:

,
.
the
in
of
and
die
der
to
und

Note that "," and "." get assigned the indices 0 and 1, conflicting with the reserved tokens.

I started a new training run with a modified vocab:

pad
eos
,
.
the
in
of
and
die
der

and this gives better results (model hasn't converged yet):

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .
INFO:tensorflow:Inference results OUTPUT: In diesem Sinne wird die Maßnahmen in der Lage sein , das amerikanische System zu bekämpfen .

{'outputs': array([ 54, 11958, 11, 10408, 164, 231, 940, 92, 2802,
9, 12051, 7944, 3, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0], dtype=int32), 'problem_choice': 0, 'inputs': array([[24820],
[ 22],
[ 3467],
[ 8753],
[ 111],
[ 322],
[ 48],
[ 4],
[ 229],
[ 10],
[ 4995],
[14095],
[ 7298],
[ 3],
[ 1],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0]], dtype=int32)}

I'd suggest that TokenTextEncoder actually reserves the first two integers for reserved tokens, and starts counting from 2, but I realize this will break compatibility with trained models. Alternatively, you could fix the vocabularies of pre-defined problems - I don't know how many are affected.

@vince62s
Copy link
Contributor

vince62s commented Oct 17, 2017

are you using your regular BPE ?
with the out-of-the-box subwords script the vocab.endefr.32768 file is as follow:
EDIT: I added spaces for the first 2 because github was not displaying properly.

' < pad > '
' < EOS > '
', '
'.
'
'the_'
'de_'
''
''
'
'in_'
'of_'

@mehmedes
Copy link

They seem to refer to regular BPE.
My BPE vocab also starts with , and . like Rico's but I didn't experience any issues with not including pad and eos. I trained my BPE model back on T2T 1.0.11.

@rsennrich
Copy link
Author

I'm using the problem translate_ende_wmt_bpe32k , which out-of-the-box downloads and uses vocab.bpe.32000 (in v1.2.4).

@mehmedes
Copy link

mehmedes commented Oct 17, 2017

@rico: Just wondering how did you manage to get a translation result in your second example using a BPE model. The input has only been tokenized and not preprocessed with BPE. When using the BPE model I think T2T requires you to tokenize as well as apply BPE before submitting for translation, and postprocess the inference result reverting BPE and detokenizing.

@lukaszkaiser
Copy link
Contributor

Does this problem still exist in 1.2.5? We tried to correct the reserved tokens in vocabs, but maybe there's still sth missing?

@rsennrich
Copy link
Author

@mehmedes : the preprocessed test sets are in /tmp/t2t_datagen (or wherever you downloaded wmt16_en_de.tar.gz ), for instance as newstest2013.tok.bpe.32000.en . From a quick look at the data, the files were (probably) preprocessed like this:

/path/to/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en | \
/path/to/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en | \
/path/to/subword-nmt/apply_bpe.py -c bpe.32000

@lukaszkaiser : there's been no change to the vocabs or the handling of reserved tokens between 1.2.4 and 1.2.5 - yes, the problem still persists.

@mehmedes
Copy link

I was asking because your input stated a clean sentence

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .

I would have expected the input to be more like

INFO:tensorflow:Inference results INPUT: In th@@ sen@@ etc. 

@rsennrich
Copy link
Author

Coincidentally, all words in this sentence happen to be frequent in the training data.

@rsepassi
Copy link
Contributor

The behavior of TokenTextEncoder with regards to reserved tokens depends on whether it is constructed from a file or from a list. If it is constructed from a file, then it's expected that the file includes reserved tokens (i.e. the first line is <pad> and the second line is <EOS>) if there are any. If it's from a list, then the reserved tokens are added. The reason is because if you initialize from a list and then call store_to_file, the file will include the reserved tokens.

@lkluo
Copy link

lkluo commented Jan 17, 2018

I am using the latest version without reported problem, when I adopted TokenTextEncoder and external subword.
BTW, have anyone tested which is better for NMT task, the internal wordpiece or external subword?

@colmantse
Copy link

what mean wordpiece and subword, you mean bpe and t2t's default?

@lkluo
Copy link

lkluo commented Jan 17, 2018

@colmantse Yes, that's exactly what I meant. It seems wordpiece has better control over vocabulary size.

@colmantse
Copy link

bpe or t2t's default? I find t2t's default better on en-zh.

@lkluo
Copy link

lkluo commented Jan 17, 2018 via email

@colmantse
Copy link

colmantse commented Jan 17, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants