Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] BPE roberta-large-mnli saved with .save_pretrained() incorrectly sets byte_fallback to false (should be true) #1234

Closed
xenova opened this issue Apr 28, 2023 · 8 comments

Comments

@xenova
Copy link

xenova commented Apr 28, 2023

Environment:

  • tokenizers: 0.13.3
  • transformers: 4.28.1
  • OS: Breaks on both Linux and Windows

Reproduction:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large-mnli')
tokenizer.save_pretrained('test')

Will output a tokenizer.json file containing:

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false, // <---- Should be true
  }

However, as seen in the slow tokenizer here, it should be true.

byte_fallback was introduced by #1183 (@Narsil), so it probably has something to do with how the default value is set (which is false).

Quite a few models use BPE with byte_fallback set to true by default. I can grab some more examples if needed.

@xenova
Copy link
Author

xenova commented Apr 28, 2023

As I come across more models which have this problem, I'll update the following list:

@chris-ha458
Copy link
Contributor

chris-ha458 commented Apr 30, 2023

Most if not all Roberta based models and GPT2 based tokenizer models should be considered
BPE algorithm based tokenizers that utilize the ByteLevel() pretokenizer.
This is different from a (unicode)char level BPE algorithm based tokenizers that utilize byte_fallback within the BPE algorithm to handle <unk> cases.
Theoretically and practically, the ByteLevel pretokenizer prevents cases by handling all possible byte combinations and thus there is no <unk> from which the BPE algorithm must fall back from.

The BPE byte_fallback was implemented to copy behavior of sentencepiece where there is no bytelevel pretokenization stage and the tokenizer "falls back" to unicode when encountering <unk>. LLaMA used sentencepiece with BPE and thus required huggingface to implement byte_fallback in BPE that wasnt necessary in a way that wasnt necessary for unigram. But since this was implemented it is also being implemented in an exposed manner in unigram too.

(disclaimer : this is an explanation of my understanding of the situation after observing development closely of both sentencepiece and huggingface tokenizers. I am currently not a developer of either)

Some references
#1218
#1217

@Narsil
Copy link
Collaborator

Narsil commented May 1, 2023

Very good explanation.

byte_fallback is indeed correctly set to false on all of these.
@xenova whenever you think something like that is incorrect, please explain why you think the values are incorrect.

Since you didn't provide explanations as to why you think it should be false, I'll simply say that this is correct without further explanation for now (sorry my time is limited)

@Narsil Narsil closed this as completed May 1, 2023
@xenova
Copy link
Author

xenova commented May 1, 2023

Thanks for the response 👍

I must have been mistaken about the purpose of byte_fallback, as I thought it also controlled whether to perform the trick of avoiding control tokens of the BPE.

If possible, could you please explain where in the BPE code it does this mapping? I'm much more familiar with Python, so I will reference the slow tokenization code:


I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.

Here are their relevant parts of their config's for reference:

GPT2:

    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false,

Nllb:

    "dropout": null,
    "unk_token": "<unk>",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": true,
    "byte_fallback": false,

There are differences (e.g., continuing_subword_prefix), but I'm not too sure which to use.

@chris-ha458
Copy link
Contributor

nllb tokenizer is BPE but sentencepiece based.
It is useful to understand what framework is being used, and what kind of preprocessing and post processing is done.
In this code base in particular, the GPT like mapping is done in the bytelevel() pretokenizer.
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/byte_level.rs

@chris-ha458
Copy link
Contributor

some options, such as the continuing word suffix were used by the word level tokenizer. you need to chain decoders to properly decode it but usually it is not what you are looking for.

One way to approach this is to look at a model that you'd like to emulate and follow their lead.

@chris-ha458
Copy link
Contributor

I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.

As it is, it is often ambiguous unfortunately. It is difficult to address from this codebase alone since the transformers codebase also deals withit through their autotokenizer class. Even if it is fixed from this codebase, there are legacy tokenizers.

@xenova
Copy link
Author

xenova commented May 1, 2023

Ohhh! Thanks for pointing that out @chris-ha458!

This solved my issue :)

xenova added a commit to huggingface/transformers.js that referenced this issue May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants