-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502
Comments
Intermediate results of debugging: |
I implemented an independent port of the gpt2-tokenizer(will share the code if someone is interested) and it shows the same behavior as the |
You can create the slow (or fast) GPT2 tokenizer in tests/test-tokenizer-0-falcon.py like so:
You will have to create the files vocab.json and merges.txt yourself. The file vocab.json should contain only the vocab map from Falcon's tokenizer.json (e.g. see https://huggingface.co/gpt2/blob/main/vocab.json). The file merges.txt should contain only the contents of the merges array, one array element per line (i.e. space separated token pairs, e.g. see https://huggingface.co/gpt2/blob/main/merges.txt). You will notice that the slow tokenizer tokenizes "2000" differently ("20" "00") than the fast one ("200" "0"). So yes, I think we are running into a HF implementation bug, but the cpp code tokenizes like the (presumably now less popular) slow tokenizer. And the <|endoftext|> in the front is trivial, it's just the artificially injected BOS token (which I believe is a Llama thing and should not be inserted for Falcon). |
So maybe it is best to switch to the slow tokenizer in |
I could imagine this to be hairy problem, because I'd assume a couple of models have been trained with the fast tokenizers? |
Yes, I suppose everyone uses the fast ones because they are default, so having a tokenizer in llama.cpp which behaves differently is not good. One point which I am still unclear about is whether the fast tokenizer, which for some reason (also) wants tokenizer.json rather than just the vocab.json/merges.txt file as input maybe relies on some extra information from tokenizer.json which makes it behave differently in our test case. So there's still some chance it might not be a bug in the HF implementation after all, but rather our lack of understanding of it. I'm hoping to learn more from HF's response to huggingface/tokenizers#1363. |
The discrepancy here is because Falcon's tokenizer.json specifies a different pre_tokenizer. Most BPE-using models use the config that mimics GPT2 - i.e. "pretokenizing" is done with the standard regex: "pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
}, replacing with this gets consistent behavior with "slow", but that's not really what we want. Falcon instead specifies this: "pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Punctuation",
"behavior": "Contiguous"
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
},
{
"type": "Digits",
"individual_digits": false
},
{
"type": "Split",
"pattern": {
"Regex": "[0-9][0-9][0-9]"
},
"behavior": "Isolated",
"invert": false
}
]
} that is, it first applies punctuation splitting before the standard regex, then the standard regex, then "Digits" (force spans of digits to be separated from non-digits) , then "isolated" mode custom regex splitting on a regex matching 3 consecutive digits so that no token over 3 digits long makes it past the pretokenizer its the last one that seems to cause the discrepancy here - but the problem is that to be fully consistent with |
Could you take a look at my code? I followed the procedure you outlined and even checked the source code, but I'm still getting inconsistent results. Is there a way to directly test the pre-tokenizer without comparing the final output of the tokenizer? This might help us pinpoint the exact issue. #5613 I'm really confident that my GPT-2 style pre-tokenizer works perfectly. I carefully followed the regex pattern and tested it extensively, using more than 10GB of data that included both synthetic and real examples. Edit: Ah, I understand now! The |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I did the following test to tokenize
wiki.test.raw
using our tokenizer and the Python tokenizer.The expectation is that the outputs will match:
The results are pretty close, but not exactly the same. Any ideas why the test does not pass?
I thought that #3252 would resolve this
cc @goerch
The text was updated successfully, but these errors were encountered: