You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently tokenizers are forcibly loaded as slow, but it's not always possible: for instance tokenizers saved from RobertaTokenizerFast don't provide a slow version (since that would require to generate a sentencepiece model from the tokenizers one, something that's currently not supported). As a result, trying to use one of the models (e.g. lgrobol/xlm-r+CreoleEval_all currently fails.
As a temporary fix I have tried to remove use_fast=False from the dataset collection code ans everything seems to work fine. Am I missing something? Is there still a reason for forcing the use of a slow tokenizer or can this be removed?
The text was updated successfully, but these errors were encountered:
LoicGrobol
changed the title
Do tokenizers have to be slow
Do tokenizers have to be slow?
Nov 2, 2023
Hmm, I'd rather have a check in place and use the slow one when it is available. I tried to tokenize the whole UD with mBERT fast and slow a couple years back, and the fast one led to many more character differences. One of the strangest example being:
Currently tokenizers are forcibly loaded as slow, but it's not always possible: for instance tokenizers saved from
RobertaTokenizerFast
don't provide a slow version (since that would require to generate asentencepiece
model from thetokenizers
one, something that's currently not supported). As a result, trying to use one of the models (e.g.lgrobol/xlm-r+CreoleEval_all
currently fails.As a temporary fix I have tried to remove
use_fast=False
from the dataset collection code ans everything seems to work fine. Am I missing something? Is there still a reason for forcing the use of a slow tokenizer or can this be removed?The text was updated successfully, but these errors were encountered: