Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do tokenizers have to be slow? #35

Closed
LoicGrobol opened this issue Nov 2, 2023 · 3 comments
Closed

Do tokenizers have to be slow? #35

LoicGrobol opened this issue Nov 2, 2023 · 3 comments

Comments

@LoicGrobol
Copy link

LoicGrobol commented Nov 2, 2023

Currently tokenizers are forcibly loaded as slow, but it's not always possible: for instance tokenizers saved from RobertaTokenizerFast don't provide a slow version (since that would require to generate a sentencepiece model from the tokenizers one, something that's currently not supported). As a result, trying to use one of the models (e.g. lgrobol/xlm-r+CreoleEval_all currently fails.

As a temporary fix I have tried to remove use_fast=False from the dataset collection code ans everything seems to work fine. Am I missing something? Is there still a reason for forcing the use of a slow tokenizer or can this be removed?

@LoicGrobol LoicGrobol changed the title Do tokenizers have to be slow Do tokenizers have to be slow? Nov 2, 2023
@robvanderg
Copy link
Contributor

Hmm, I'd rather have a check in place and use the slow one when it is available. I tried to tokenize the whole UD with mBERT fast and slow a couple years back, and the fast one led to many more character differences. One of the strangest example being:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
tokenizer.decode(tokenizer.encode("do not"))
"[CLS] don't [SEP]"

@robvanderg
Copy link
Contributor

Ok, I surrounded it in a try-except block for now, seems like a robust solution

@LoicGrobol
Copy link
Author

Wow, this is so bizarre, I had no idea. Anyway, thank you for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants