-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Is there a way to know the exact BPE components each word is decomposed into? #408
Comments
do you mean subword implementation or bpe? because I am not aware of a bpe implementation at t2t. Normally, with bpe you would have a bpe-model that segregate words into wordpieces, so you can check by requesting bpe to break the word down, it could be 'learning' to. |
There is a subword implementation in T2T, which is very similar to BPE. There is also a wrapper script text_encoder_build_subword.py, which can be used to train the subword vocabulary separately (either from a corpus or from a word vocabulary). When training e.g. However, I am not aware of a simple way how to check this, i.e. to apply a trained subword vocab to a given string. Related questions: How to know in T2T how many subwords are in my training data? And how to convert a number of training steps to a number of epochs (passes over the training data). |
You can test this by encoding the string with your vocab file:
|
@twairball Thanks for a hint (I was sure it is possible, but I just have not tried it yet.) from tensor2tensor.data_generators import text_encoder
vocab = text_encoder.SubwordTextEncoder(vocab_filepath)
print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learnings")]) which prints So I think @anglil can close this issue. I have opened another issue #415 for the number of epochs. BTW: my guess with "learnings" followed by a comma was wrong. The underscore does not mean a space: >>> print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learning,")])
[['learning'], [',']]
>>> print([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode("learning ,")])
[['learning'], [' ,']] |
Closing as these answers seem sufficient. Thank you @twairball, @martinpopel, and @colmantse! |
For example, if the word "learning" is decomposed into "le", "arn", "ing", is it possible to know so?
The text was updated successfully, but these errors were encountered: