TokenTextEncoder does not respect reserved tokens #365

rsennrich · 2017-10-16T15:10:17Z

I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and the provided vocabulary file.

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .
INFO:tensorflow:Inference results OUTPUT: In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In In
In In In In In

{'outputs': array([68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68,
68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68], dtype=int32), 'problem_choice': 0, 'inputs': array([[ 68],
[ 35],
[ 2196],
[ 0],
[ 2],
[ 651],
[ 55],
[18587],
[15840],
[ 2],
[ 1874],
[ 1763],
[ 260],
[ 1],
[ 1],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0]], dtype=int32)}

The first 10 lines of vocab.bpe.32000 look like this:

,
.
the
in
of
and
die
der
to
und

Note that "," and "." get assigned the indices 0 and 1, conflicting with the reserved tokens.

I started a new training run with a modified vocab:

pad
eos
,
.
the
in
of
and
die
der

and this gives better results (model hasn't converged yet):

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .
INFO:tensorflow:Inference results OUTPUT: In diesem Sinne wird die Maßnahmen in der Lage sein , das amerikanische System zu bekämpfen .

{'outputs': array([ 54, 11958, 11, 10408, 164, 231, 940, 92, 2802,
9, 12051, 7944, 3, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0], dtype=int32), 'problem_choice': 0, 'inputs': array([[24820],
[ 22],
[ 3467],
[ 8753],
[ 111],
[ 322],
[ 48],
[ 4],
[ 229],
[ 10],
[ 4995],
[14095],
[ 7298],
[ 3],
[ 1],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0],
[ 0]], dtype=int32)}

I'd suggest that TokenTextEncoder actually reserves the first two integers for reserved tokens, and starts counting from 2, but I realize this will break compatibility with trained models. Alternatively, you could fix the vocabularies of pre-defined problems - I don't know how many are affected.

The text was updated successfully, but these errors were encountered:

vince62s · 2017-10-17T12:26:02Z

are you using your regular BPE ?
with the out-of-the-box subwords script the vocab.endefr.32768 file is as follow:
EDIT: I added spaces for the first 2 because github was not displaying properly.

' < pad > '
' < EOS > '
', '
'.'
'the_'
'de_'
''
'''
'in_'
'of_'

mehmedes · 2017-10-17T12:51:09Z

They seem to refer to regular BPE.
My BPE vocab also starts with , and . like Rico's but I didn't experience any issues with not including pad and eos. I trained my BPE model back on T2T 1.0.11.

rsennrich · 2017-10-17T12:55:02Z

I'm using the problem translate_ende_wmt_bpe32k , which out-of-the-box downloads and uses vocab.bpe.32000 (in v1.2.4).

mehmedes · 2017-10-17T19:14:25Z

@rico: Just wondering how did you manage to get a translation result in your second example using a BPE model. The input has only been tokenized and not preprocessed with BPE. When using the BPE model I think T2T requires you to tokenize as well as apply BPE before submitting for translation, and postprocess the inference result reverting BPE and detokenizing.

lukaszkaiser · 2017-10-17T22:00:24Z

Does this problem still exist in 1.2.5? We tried to correct the reserved tokens in vocabs, but maybe there's still sth missing?

rsennrich · 2017-10-18T08:58:11Z

@mehmedes : the preprocessed test sets are in /tmp/t2t_datagen (or wherever you downloaded wmt16_en_de.tar.gz ), for instance as newstest2013.tok.bpe.32000.en . From a quick look at the data, the files were (probably) preprocessed like this:

/path/to/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en | \
/path/to/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en | \
/path/to/subword-nmt/apply_bpe.py -c bpe.32000

@lukaszkaiser : there's been no change to the vocabs or the handling of reserved tokens between 1.2.4 and 1.2.5 - yes, the problem still persists.

mehmedes · 2017-10-18T09:18:45Z

I was asking because your input stated a clean sentence

INFO:tensorflow:Inference results INPUT: In this sense , the measures will partially undermine the American democratic system .

I would have expected the input to be more like

INFO:tensorflow:Inference results INPUT: In th@@ sen@@ etc.

rsennrich · 2017-10-18T09:26:00Z

Coincidentally, all words in this sentence happen to be frequent in the training data.

rsepassi · 2017-11-13T22:58:03Z

The behavior of TokenTextEncoder with regards to reserved tokens depends on whether it is constructed from a file or from a list. If it is constructed from a file, then it's expected that the file includes reserved tokens (i.e. the first line is <pad> and the second line is <EOS>) if there are any. If it's from a list, then the reserved tokens are added. The reason is because if you initialize from a list and then call store_to_file, the file will include the reserved tokens.

lkluo · 2018-01-17T12:02:23Z

I am using the latest version without reported problem, when I adopted TokenTextEncoder and external subword.
BTW, have anyone tested which is better for NMT task, the internal wordpiece or external subword?

colmantse · 2018-01-17T12:56:29Z

what mean wordpiece and subword, you mean bpe and t2t's default?

lkluo · 2018-01-17T13:36:58Z

@colmantse Yes, that's exactly what I meant. It seems wordpiece has better control over vocabulary size.

colmantse · 2018-01-17T14:16:32Z

bpe or t2t's default? I find t2t's default better on en-zh.

lkluo · 2018-01-17T14:39:50Z

Good to know. Have you processed Chinese sentence with segmentation?

…

On 17 Jan 2018 22:16, "colmantse" ***@***.***> wrote: bpe or t2t's default? I find t2t's default better on en-zh. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#365 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C> .

colmantse · 2018-01-17T14:41:12Z

I didnt preprocess them. Judging by performance, it works just fine. 取得 Outlook for Android On Wed, Jan 17, 2018 at 10:39 PM +0800, "lkluo" <[email protected]> wrote: Good to know. Have you processed Chinese sentence with segmentation?

On 17 Jan 2018 22:16, "colmantse" ***@***.***> wrote: bpe or t2t's default? I find t2t's default better on en-zh. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#365 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rsepassi added the question label Nov 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenTextEncoder does not respect reserved tokens #365

TokenTextEncoder does not respect reserved tokens #365

rsennrich commented Oct 16, 2017 •

edited

Loading

vince62s commented Oct 17, 2017 •

edited

Loading

mehmedes commented Oct 17, 2017

rsennrich commented Oct 17, 2017

mehmedes commented Oct 17, 2017 •

edited

Loading

lukaszkaiser commented Oct 17, 2017

rsennrich commented Oct 18, 2017

mehmedes commented Oct 18, 2017

rsennrich commented Oct 18, 2017

rsepassi commented Nov 13, 2017

lkluo commented Jan 17, 2018 •

edited

Loading

colmantse commented Jan 17, 2018

lkluo commented Jan 17, 2018

colmantse commented Jan 17, 2018

lkluo commented Jan 17, 2018 via email

colmantse commented Jan 17, 2018 via email

TokenTextEncoder does not respect reserved tokens #365

TokenTextEncoder does not respect reserved tokens #365

Comments

rsennrich commented Oct 16, 2017 • edited Loading

vince62s commented Oct 17, 2017 • edited Loading

mehmedes commented Oct 17, 2017

rsennrich commented Oct 17, 2017

mehmedes commented Oct 17, 2017 • edited Loading

lukaszkaiser commented Oct 17, 2017

rsennrich commented Oct 18, 2017

mehmedes commented Oct 18, 2017

rsennrich commented Oct 18, 2017

rsepassi commented Nov 13, 2017

lkluo commented Jan 17, 2018 • edited Loading

colmantse commented Jan 17, 2018

lkluo commented Jan 17, 2018

colmantse commented Jan 17, 2018

lkluo commented Jan 17, 2018 via email

colmantse commented Jan 17, 2018 via email

rsennrich commented Oct 16, 2017 •

edited

Loading

vince62s commented Oct 17, 2017 •

edited

Loading

mehmedes commented Oct 17, 2017 •

edited

Loading

lkluo commented Jan 17, 2018 •

edited

Loading