-
Notifications
You must be signed in to change notification settings - Fork 3.5k
TokenTextEncoder does not respect reserved tokens #365
Comments
are you using your regular BPE ? ' < pad > ' |
They seem to refer to regular BPE. |
I'm using the problem translate_ende_wmt_bpe32k , which out-of-the-box downloads and uses vocab.bpe.32000 (in v1.2.4). |
@rico: Just wondering how did you manage to get a translation result in your second example using a BPE model. The input has only been tokenized and not preprocessed with BPE. When using the BPE model I think T2T requires you to tokenize as well as apply BPE before submitting for translation, and postprocess the inference result reverting BPE and detokenizing. |
Does this problem still exist in 1.2.5? We tried to correct the reserved tokens in vocabs, but maybe there's still sth missing? |
@mehmedes : the preprocessed test sets are in /tmp/t2t_datagen (or wherever you downloaded wmt16_en_de.tar.gz ), for instance as newstest2013.tok.bpe.32000.en . From a quick look at the data, the files were (probably) preprocessed like this:
@lukaszkaiser : there's been no change to the vocabs or the handling of reserved tokens between 1.2.4 and 1.2.5 - yes, the problem still persists. |
I was asking because your input stated a clean sentence
I would have expected the input to be more like
|
Coincidentally, all words in this sentence happen to be frequent in the training data. |
The behavior of |
I am using the latest version without reported problem, when I adopted TokenTextEncoder and external subword. |
what mean wordpiece and subword, you mean bpe and t2t's default? |
@colmantse Yes, that's exactly what I meant. It seems wordpiece has better control over vocabulary size. |
bpe or t2t's default? I find t2t's default better on en-zh. |
Good to know. Have you processed Chinese sentence with segmentation?
…On 17 Jan 2018 22:16, "colmantse" ***@***.***> wrote:
bpe or t2t's default? I find t2t's default better on en-zh.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#365 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C>
.
|
I didnt preprocess them. Judging by performance, it works just fine.
取得 Outlook for Android
On Wed, Jan 17, 2018 at 10:39 PM +0800, "lkluo" <[email protected]> wrote:
Good to know. Have you processed Chinese sentence with segmentation?
On 17 Jan 2018 22:16, "colmantse" ***@***.***> wrote:
bpe or t2t's default? I find t2t's default better on en-zh.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#365 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AY0L4C9C_KS9CuW1MIng-uQWcTx0LTxmks5tLgDHgaJpZM4P6x0C>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I get poor results for translate_ende_wmt_bpe32k (this may be related to issues #317 , #309 ), and believe this is due to a problem with reserved tokens not being respected by TokenTextEncoder, and the provided vocabulary file.
The first 10 lines of vocab.bpe.32000 look like this:
Note that "," and "." get assigned the indices 0 and 1, conflicting with the reserved tokens.
I started a new training run with a modified vocab:
and this gives better results (model hasn't converged yet):
I'd suggest that TokenTextEncoder actually reserves the first two integers for reserved tokens, and starts counting from 2, but I realize this will break compatibility with trained models. Alternatively, you could fix the vocabularies of pre-defined problems - I don't know how many are affected.
The text was updated successfully, but these errors were encountered: