Do tokenizers have to be slow? #35

LoicGrobol · 2023-11-02T20:23:41Z

Currently tokenizers are forcibly loaded as slow, but it's not always possible: for instance tokenizers saved from RobertaTokenizerFast don't provide a slow version (since that would require to generate a sentencepiece model from the tokenizers one, something that's currently not supported). As a result, trying to use one of the models (e.g. lgrobol/xlm-r+CreoleEval_all currently fails.

As a temporary fix I have tried to remove use_fast=False from the dataset collection code ans everything seems to work fine. Am I missing something? Is there still a reason for forcing the use of a slow tokenizer or can this be removed?

The text was updated successfully, but these errors were encountered:

robvanderg · 2023-11-02T20:44:47Z

Hmm, I'd rather have a check in place and use the slow one when it is available. I tried to tokenize the whole UD with mBERT fast and slow a couple years back, and the fast one led to many more character differences. One of the strangest example being:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.decode(tokenizer.encode("do not"))
"[CLS] don't [SEP]"

robvanderg · 2023-11-02T21:02:44Z

Ok, I surrounded it in a try-except block for now, seems like a robust solution

LoicGrobol · 2023-11-02T21:19:57Z

Wow, this is so bizarre, I had no idea. Anyway, thank you for the quick fix!

LoicGrobol changed the title ~~Do tokenizers have to be slow~~ Do tokenizers have to be slow? Nov 2, 2023

robvanderg closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do tokenizers have to be slow? #35

Do tokenizers have to be slow? #35

LoicGrobol commented Nov 2, 2023 •

edited

Loading

robvanderg commented Nov 2, 2023

robvanderg commented Nov 2, 2023

LoicGrobol commented Nov 2, 2023

Do tokenizers have to be slow? #35

Do tokenizers have to be slow? #35

Comments

LoicGrobol commented Nov 2, 2023 • edited Loading

robvanderg commented Nov 2, 2023

robvanderg commented Nov 2, 2023

LoicGrobol commented Nov 2, 2023

LoicGrobol commented Nov 2, 2023 •

edited

Loading