-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT wordpiece tokenizer differers from official HF implementation #5496
Comments
@cebtenzzre look like duplicate this issue? #3502 there is a PR here #4868 |
so, now BERT based models supported? |
We likely need to move all the tokenization-related code from llama.cpp to a separate file. Otherwise, the llama.cpp will become too messy. |
Possibly related, but keep in mind that BERT uses an entirely separate tokenizer implementation (wordpiece "WPM") from all other models (SentencePiece "SPM" or GPT-2 "BPE"). |
Is the SPM preprocessor also replacing accented characters? Seems like we should be able to reuse bits from that. Btw, in case it's useful for folks, I made a little Python function that prints out a color-coded token diff between our results and those from Huggingface (it goes through https://gist.github.com/iamlemec/52eaa4961762efb9c064b871a67f6cc6 The biggest instance I'm finding there is with dash variants like emdash. But basically still a case of replacing certain complex characters with their base forms. |
A comment regarding this issue from @apage43:
|
That's very helpful @cebtenzzre! Opening a PR with this in a minute. |
@cebtenzzre can you take a look on new deploy to see the improvment? #5740 |
Our wordpiece tokenizer has issues with unicode. One of the problems is incomplete NFD normalization, causing many characters with accents to be dropped entirely when tokenized. Examples include
Kantō
->Kant
andlǜshi
->lshi
.Here is the diff for nomic-embed-text-v1 on wikitext.test.raw:
Diff
The text was updated successfully, but these errors were encountered: