-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] BPE roberta-large-mnli
saved with .save_pretrained()
incorrectly sets byte_fallback
to false (should be true)
#1234
Comments
As I come across more models which have this problem, I'll update the following list:
|
Most if not all Roberta based models and GPT2 based tokenizer models should be considered The BPE byte_fallback was implemented to copy behavior of sentencepiece where there is no bytelevel pretokenization stage and the tokenizer "falls back" to unicode when encountering (disclaimer : this is an explanation of my understanding of the situation after observing development closely of both sentencepiece and huggingface tokenizers. I am currently not a developer of either) |
Very good explanation.
Since you didn't provide explanations as to why you think it should be false, I'll simply say that this is correct without further explanation for now (sorry my time is limited) |
Thanks for the response 👍 I must have been mistaken about the purpose of If possible, could you please explain where in the BPE code it does this mapping? I'm much more familiar with Python, so I will reference the slow tokenization code:
I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction. Here are their relevant parts of their config's for reference: GPT2:
Nllb:
There are differences (e.g., |
nllb tokenizer is BPE but sentencepiece based. |
some options, such as the continuing word suffix were used by the word level tokenizer. you need to chain decoders to properly decode it but usually it is not what you are looking for. One way to approach this is to look at a model that you'd like to emulate and follow their lead. |
As it is, it is often ambiguous unfortunately. It is difficult to address from this codebase alone since the transformers codebase also deals withit through their autotokenizer class. Even if it is fixed from this codebase, there are legacy tokenizers. |
Ohhh! Thanks for pointing that out @chris-ha458! This solved my issue :) |
For more information: huggingface/tokenizers#1234 (comment)
Environment:
Reproduction:
Will output a tokenizer.json file containing:
However, as seen in the slow tokenizer here, it should be true.
byte_fallback
was introduced by #1183 (@Narsil), so it probably has something to do with how the default value is set (which is false).Quite a few models use BPE with byte_fallback set to true by default. I can grab some more examples if needed.
The text was updated successfully, but these errors were encountered: