[bug] BPE `roberta-large-mnli` saved with `.save_pretrained()` incorrectly sets `byte_fallback` to false (should be true) #1234

xenova · 2023-04-28T21:18:28Z

Environment:

tokenizers: 0.13.3
transformers: 4.28.1
OS: Breaks on both Linux and Windows

Reproduction:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large-mnli')
tokenizer.save_pretrained('test')

Will output a tokenizer.json file containing:

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false, // <---- Should be true
  }

However, as seen in the slow tokenizer here, it should be true.

byte_fallback was introduced by #1183 (@Narsil), so it probably has something to do with how the default value is set (which is false).

Quite a few models use BPE with byte_fallback set to true by default. I can grab some more examples if needed.

The text was updated successfully, but these errors were encountered:

xenova · 2023-04-28T21:28:46Z

As I come across more models which have this problem, I'll update the following list:

roberta-large-mnli
distilroberta-base
roberta-base
all models which use GPT2Tokenizer, like EleutherAI/gpt-neo-125m

chris-ha458 · 2023-04-30T15:46:21Z

Most if not all Roberta based models and GPT2 based tokenizer models should be considered
BPE algorithm based tokenizers that utilize the ByteLevel() pretokenizer.
This is different from a (unicode)char level BPE algorithm based tokenizers that utilize byte_fallback within the BPE algorithm to handle <unk> cases.
Theoretically and practically, the ByteLevel pretokenizer prevents cases by handling all possible byte combinations and thus there is no <unk> from which the BPE algorithm must fall back from.

The BPE byte_fallback was implemented to copy behavior of sentencepiece where there is no bytelevel pretokenization stage and the tokenizer "falls back" to unicode when encountering <unk>. LLaMA used sentencepiece with BPE and thus required huggingface to implement byte_fallback in BPE that wasnt necessary in a way that wasnt necessary for unigram. But since this was implemented it is also being implemented in an exposed manner in unigram too.

(disclaimer : this is an explanation of my understanding of the situation after observing development closely of both sentencepiece and huggingface tokenizers. I am currently not a developer of either)

Some references
#1218
#1217

Narsil · 2023-05-01T08:54:11Z

Very good explanation.

byte_fallback is indeed correctly set to false on all of these.
@xenova whenever you think something like that is incorrect, please explain why you think the values are incorrect.

Since you didn't provide explanations as to why you think it should be false, I'll simply say that this is correct without further explanation for now (sorry my time is limited)

xenova · 2023-05-01T13:55:53Z

Thanks for the response 👍

I must have been mistaken about the purpose of byte_fallback, as I thought it also controlled whether to perform the trick of avoiding control tokens of the BPE.

If possible, could you please explain where in the BPE code it does this mapping? I'm much more familiar with Python, so I will reference the slow tokenization code:

In some situations (e.g., GPT2Tokenizer), the tokenization maps bytes to unicode strings:
https://github.com/huggingface/transformers/blob/78941b9fe50c7a1c53d2e6723e9a3b7f5ae2a715/src/transformers/models/gpt2/tokenization_gpt2.py#L298-L306
However, in other situations, (e.g., NllbTokenizer), this does not happen: https://github.com/huggingface/transformers/blob/78941b9fe50c7a1c53d2e6723e9a3b7f5ae2a715/src/transformers/models/nllb/tokenization_nllb.py#L334-L335

I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.

Here are their relevant parts of their config's for reference:

GPT2:

    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "byte_fallback": false,

Nllb:

    "dropout": null,
    "unk_token": "<unk>",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": true,
    "byte_fallback": false,

There are differences (e.g., continuing_subword_prefix), but I'm not too sure which to use.

chris-ha458 · 2023-05-01T14:04:40Z

nllb tokenizer is BPE but sentencepiece based.
It is useful to understand what framework is being used, and what kind of preprocessing and post processing is done.
In this code base in particular, the GPT like mapping is done in the bytelevel() pretokenizer.
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/byte_level.rs

chris-ha458 · 2023-05-01T14:05:59Z

some options, such as the continuing word suffix were used by the word level tokenizer. you need to chain decoders to properly decode it but usually it is not what you are looking for.

One way to approach this is to look at a model that you'd like to emulate and follow their lead.

chris-ha458 · 2023-05-01T14:08:30Z

I would greatly appreciate if you could point out any part in the tokenizer.json file where I can make this distinction.

As it is, it is often ambiguous unfortunately. It is difficult to address from this codebase alone since the transformers codebase also deals withit through their autotokenizer class. Even if it is fixed from this codebase, there are legacy tokenizers.

xenova · 2023-05-01T15:14:44Z

Ohhh! Thanks for pointing that out @chris-ha458!

This solved my issue :)

For more information: huggingface/tokenizers#1234 (comment)

Narsil closed this as completed May 1, 2023

xenova added a commit to huggingface/transformers.js that referenced this issue May 1, 2023

Fix ByteLevel pretokenization

d0509e0

For more information: huggingface/tokenizers#1234 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] BPE `roberta-large-mnli` saved with `.save_pretrained()` incorrectly sets `byte_fallback` to false (should be true) #1234

[bug] BPE `roberta-large-mnli` saved with `.save_pretrained()` incorrectly sets `byte_fallback` to false (should be true) #1234

xenova commented Apr 28, 2023 •

edited

Loading

xenova commented Apr 28, 2023 •

edited

Loading

chris-ha458 commented Apr 30, 2023 •

edited

Loading

Narsil commented May 1, 2023

xenova commented May 1, 2023 •

edited

Loading

chris-ha458 commented May 1, 2023

chris-ha458 commented May 1, 2023

chris-ha458 commented May 1, 2023

xenova commented May 1, 2023

[bug] BPE roberta-large-mnli saved with .save_pretrained() incorrectly sets byte_fallback to false (should be true) #1234

[bug] BPE roberta-large-mnli saved with .save_pretrained() incorrectly sets byte_fallback to false (should be true) #1234

Comments

xenova commented Apr 28, 2023 • edited Loading

xenova commented Apr 28, 2023 • edited Loading

chris-ha458 commented Apr 30, 2023 • edited Loading

Narsil commented May 1, 2023

xenova commented May 1, 2023 • edited Loading

chris-ha458 commented May 1, 2023

chris-ha458 commented May 1, 2023

chris-ha458 commented May 1, 2023

xenova commented May 1, 2023

[bug] BPE `roberta-large-mnli` saved with `.save_pretrained()` incorrectly sets `byte_fallback` to false (should be true) #1234

[bug] BPE `roberta-large-mnli` saved with `.save_pretrained()` incorrectly sets `byte_fallback` to false (should be true) #1234

xenova commented Apr 28, 2023 •

edited

Loading

xenova commented Apr 28, 2023 •

edited

Loading

chris-ha458 commented Apr 30, 2023 •

edited

Loading

xenova commented May 1, 2023 •

edited

Loading