Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Is there a way to know the exact BPE components each word is decomposed into? #408

Closed
anglil opened this issue Nov 9, 2017 · 5 comments
Closed

Comments

@anglil
Copy link

anglil commented Nov 9, 2017

For example, if the word "learning" is decomposed into "le", "arn", "ing", is it possible to know so?

@colmantse
Copy link

do you mean subword implementation or bpe? because I am not aware of a bpe implementation at t2t. Normally, with bpe you would have a bpe-model that segregate words into wordpieces, so you can check by requesting bpe to break the word down, it could be 'learning' to.

@martinpopel
Copy link
Contributor

There is a subword implementation in T2T, which is very similar to BPE. There is also a wrapper script text_encoder_build_subword.py, which can be used to train the subword vocabulary separately (either from a corpus or from a word vocabulary).

When training e.g. translate_ende_wmt32k, the subword vocabulary is stored in ~/t2t_data/vocab.ende.32768 and it is a plain text file and on line 4496 I can see 'learning_', which means that this word (followed by a space) is represented as one subword in this case. However, if "learning" was not followed by a space, but a comma, there is no such entry in the vocabulary, so I guess it will have to be divided as lear + ning + ,_ because the vocabulary contains the following relevant entries in this order: ,_ ing ni ng ar le rn ear lea in ea nin arn earn ning lear.

However, I am not aware of a simple way how to check this, i.e. to apply a trained subword vocab to a given string.

Related questions: How to know in T2T how many subwords are in my training data? And how to convert a number of training steps to a number of epochs (passes over the training data).

@twairball
Copy link
Contributor

twairball commented Nov 13, 2017

You can test this by encoding the string with your vocab file:

vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
encodings = vocab.encode("learning")
subwords = vocab._subtoken_ids_to_tokens(encodings)
print(subwords)

@martinpopel
Copy link
Contributor

@twairball Thanks for a hint (I was sure it is possible, but I just have not tried it yet.)
Your code does not work as intended because _subtoken_ids_to_tokens concatenates all the subwords without any delimiter.
We need to use e.g.

from tensor2tensor.data_generators import text_encoder
vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learnings")])

which prints [['lear'], ['ning'], ['s']].

So I think @anglil can close this issue. I have opened another issue #415 for the number of epochs.

BTW: my guess with "learnings" followed by a comma was wrong. The underscore does not mean a space:

>>> print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learning,")])
[['learning'], [',']]
>>> print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learning ,")])
[['learning'], [' ,']]

@rsepassi
Copy link
Contributor

Closing as these answers seem sufficient. Thank you @twairball, @martinpopel, and @colmantse!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants