Is there a way to know the exact BPE components each word is decomposed into? #408

anglil · 2017-11-09T23:35:47Z

For example, if the word "learning" is decomposed into "le", "arn", "ing", is it possible to know so?

colmantse · 2017-11-10T05:47:41Z

do you mean subword implementation or bpe? because I am not aware of a bpe implementation at t2t. Normally, with bpe you would have a bpe-model that segregate words into wordpieces, so you can check by requesting bpe to break the word down, it could be 'learning' to.

martinpopel · 2017-11-13T13:12:00Z

There is a subword implementation in T2T, which is very similar to BPE. There is also a wrapper script text_encoder_build_subword.py, which can be used to train the subword vocabulary separately (either from a corpus or from a word vocabulary).

When training e.g. translate_ende_wmt32k, the subword vocabulary is stored in ~/t2t_data/vocab.ende.32768 and it is a plain text file and on line 4496 I can see 'learning_', which means that this word (followed by a space) is represented as one subword in this case. However, if "learning" was not followed by a space, but a comma, there is no such entry in the vocabulary, so I guess it will have to be divided as lear + ning + ,_ because the vocabulary contains the following relevant entries in this order: ,_ ing ni ng ar le rn ear lea in ea nin arn earn ning lear.

However, I am not aware of a simple way how to check this, i.e. to apply a trained subword vocab to a given string.

Related questions: How to know in T2T how many subwords are in my training data? And how to convert a number of training steps to a number of epochs (passes over the training data).

twairball · 2017-11-13T16:05:03Z

You can test this by encoding the string with your vocab file:

vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
encodings = vocab.encode("learning")
subwords = vocab._subtoken_ids_to_tokens(encodings)
print(subwords)

martinpopel · 2017-11-13T16:49:31Z

@twairball Thanks for a hint (I was sure it is possible, but I just have not tried it yet.)
Your code does not work as intended because _subtoken_ids_to_tokens concatenates all the subwords without any delimiter.
We need to use e.g.

from tensor2tensor.data_generators import text_encoder
vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learnings")])

which prints [['lear'], ['ning'], ['s']].

So I think @anglil can close this issue. I have opened another issue #415 for the number of epochs.

BTW: my guess with "learnings" followed by a comma was wrong. The underscore does not mean a space:

>>> print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learning,")])
[['learning'], [',']]
>>> print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learning ,")])
[['learning'], [' ,']]

rsepassi · 2017-11-13T21:34:42Z

Closing as these answers seem sufficient. Thank you @twairball, @martinpopel, and @colmantse!

rsepassi closed this as completed Nov 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to know the exact BPE components each word is decomposed into? #408

Is there a way to know the exact BPE components each word is decomposed into? #408

anglil commented Nov 9, 2017

colmantse commented Nov 10, 2017

martinpopel commented Nov 13, 2017

twairball commented Nov 13, 2017 •

edited

Loading

martinpopel commented Nov 13, 2017

rsepassi commented Nov 13, 2017

Is there a way to know the exact BPE components each word is decomposed into? #408

Is there a way to know the exact BPE components each word is decomposed into? #408

Comments

anglil commented Nov 9, 2017

colmantse commented Nov 10, 2017

martinpopel commented Nov 13, 2017

twairball commented Nov 13, 2017 • edited Loading

martinpopel commented Nov 13, 2017

rsepassi commented Nov 13, 2017

twairball commented Nov 13, 2017 •

edited

Loading