llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

ggerganov · 2023-10-06T13:42:05Z

I did the following test to tokenize wiki.test.raw using our tokenizer and the Python tokenizer.
The expectation is that the outputs will match:

# generate ggml-vocab-falcon.gguf
./convert-falcon-hf-to-gguf.py --vocab-only ~/development/huggingface/falcon-7b/ --outfile ./models/ggml-vocab-falcon.gguf

# tokenize using Python
python3 tests/test-tokenizer-0-falcon.py ~/development/huggingface/falcon-7b/ --fname-tok ./build/wikitext-2-raw/wiki.test.raw

# tokenize using llama.cpp
cd build
make -j
./bin/test-tokenizer-0-falcon ../models/ggml-vocab-falcon.gguf ./wikitext-2-raw/wiki.test.raw

# compare the results
cmp ./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp 
./wikitext-2-raw/wiki.test.raw.tok ./wikitext-2-raw/wiki.test.raw.tokcpp differ: char 1, line 1

The results are pretty close, but not exactly the same. Any ideas why the test does not pass?
I thought that #3252 would resolve this

cc @goerch

The text was updated successfully, but these errors were encountered:

goerch · 2023-10-06T14:51:05Z

That is a nice test. I made some modifications to get more detailed outputs of the tests and see differences like

Problem with endoftext

Non greediness

goerch · 2023-10-06T19:29:55Z

Intermediate results of debugging: bpe_gpt2_preprocess seems to do the right thing, llm_tokenizer_bpe::tokenize seems to be subtly broken, although it looks very similar to examples/gptneox-wip. Paging @cmp-nct in need for help, because git blame doesn't work very well here.

goerch · 2023-10-07T20:11:34Z

llm_tokenizer_bpe::tokenize seems to be subtly broken

I implemented an independent port of the gpt2-tokenizer(will share the code if someone is interested) and it shows the same behavior as the llama.cpp tokenizer. I also tried to use the slow tokenizer of HF (i.e. the Python implementation) to compare without success, i.e. I didn't get it working (any tips appreciated!). I tried to understand the fast tokenizer implemented here, which looks like it supports some advanced features. This seems hopeless until we are going to debug the complete HF stack (which I currently don't know how to). BTW: I'm not talking about the endoftext problem here.

jploski · 2023-10-09T20:56:15Z

llm_tokenizer_bpe::tokenize seems to be subtly broken

I implemented an independent port of the gpt2-tokenizer(will share the code if someone is interested) and it shows the same behavior as the llama.cpp tokenizer. I also tried to use the slow tokenizer of HF (i.e. the Python implementation) to compare without success, i.e. I didn't get it working (any tips appreciated!). I tried to understand the fast tokenizer implemented here, which looks like it supports some advanced features. This seems hopeless until we are going to debug the complete HF stack (which I currently don't know how to). BTW: I'm not talking about the endoftext problem here.

You can create the slow (or fast) GPT2 tokenizer in tests/test-tokenizer-0-falcon.py like so:

from transformers import GPT2TokenizerFast, GPT2Tokenizer
slow_tokenizer = GPT2Tokenizer(vocab_file=dir_tokenizer + '/vocab.json', merges_file=dir_tokenizer + '/merges.txt')
fast_tokenizer = GPT2TokenizerFast(tokenizer_file=dir_tokenizer + '/tokenizer.json')

You will have to create the files vocab.json and merges.txt yourself. The file vocab.json should contain only the vocab map from Falcon's tokenizer.json (e.g. see https://huggingface.co/gpt2/blob/main/vocab.json). The file merges.txt should contain only the contents of the merges array, one array element per line (i.e. space separated token pairs, e.g. see https://huggingface.co/gpt2/blob/main/merges.txt).

You will notice that the slow tokenizer tokenizes "2000" differently ("20" "00") than the fast one ("200" "0"). So yes, I think we are running into a HF implementation bug, but the cpp code tokenizes like the (presumably now less popular) slow tokenizer.

And the <|endoftext|> in the front is trivial, it's just the artificially injected BOS token (which I believe is a Llama thing and should not be inserted for Falcon).

ggerganov · 2023-10-10T12:06:21Z

So maybe it is best to switch to the slow tokenizer in test-tokenizer-0-falcon.py and close the issue if things match?
Probably also add this as a ctest

goerch · 2023-10-10T12:10:48Z

I could imagine this to be hairy problem, because I'd assume a couple of models have been trained with the fast tokenizers?

jploski · 2023-10-10T12:22:53Z

I could imagine this to be hairy problem, because I'd assume a couple of models have been trained with the fast tokenizers?

Yes, I suppose everyone uses the fast ones because they are default, so having a tokenizer in llama.cpp which behaves differently is not good.

One point which I am still unclear about is whether the fast tokenizer, which for some reason (also) wants tokenizer.json rather than just the vocab.json/merges.txt file as input maybe relies on some extra information from tokenizer.json which makes it behave differently in our test case. So there's still some chance it might not be a bug in the HF implementation after all, but rather our lack of understanding of it. I'm hoping to learn more from HF's response to huggingface/tokenizers#1363.

apage43 · 2023-10-10T16:50:32Z

The discrepancy here is because Falcon's tokenizer.json specifies a different pre_tokenizer.

Most BPE-using models use the config that mimics GPT2 - i.e. "pretokenizing" is done with the standard regex:

  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  },

replacing with this gets consistent behavior with "slow", but that's not really what we want. Falcon instead specifies this:

"pre_tokenizer": {
  "type": "Sequence",
  "pretokenizers": [
    {
      "type": "Punctuation",
      "behavior": "Contiguous"
    },
    {
      "type": "ByteLevel",
      "add_prefix_space": false,
      "trim_offsets": true,
      "use_regex": true
    },
    {
      "type": "Digits",
      "individual_digits": false
    },
    {
      "type": "Split",
      "pattern": {
        "Regex": "[0-9][0-9][0-9]"
      },
      "behavior": "Isolated",
      "invert": false
    }
  ]
}

that is, it first applies punctuation splitting before the standard regex, then the standard regex, then "Digits" (force spans of digits to be separated from non-digits) , then "isolated" mode custom regex splitting on a regex matching 3 consecutive digits so that no token over 3 digits long makes it past the pretokenizer

its the last one that seems to cause the discrepancy here - but the problem is that to be fully consistent with tokenizers we will need to implement some of the less common pre_tokenizer options

bobqianic · 2024-02-20T17:41:18Z

@apage43 cc @ggerganov

Could you take a look at my code? I followed the procedure you outlined and even checked the source code, but I'm still getting inconsistent results. Is there a way to directly test the pre-tokenizer without comparing the final output of the tokenizer? This might help us pinpoint the exact issue. #5613

I'm really confident that my GPT-2 style pre-tokenizer works perfectly. I carefully followed the regex pattern and tested it extensively, using more than 10GB of data that included both synthetic and real examples.

Edit: Ah, I understand now! The = symbol isn't classified under Unicode punctuation!

github-actions · 2024-04-06T01:06:25Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

This was referenced Oct 9, 2023

MPT support in llama.cpp #3417

Merged

tokenizer : special token handling #3538

Merged

goerch changed the title ~~llama.cpp BPE tokenization of wiki.test does not match the Python tokenization~~ llama.cpp BPE tokenization of wiki.test does not match the HF tokenization Oct 10, 2023

This was referenced Oct 10, 2023

Difference between slow and fast GPT2 tokenizers huggingface/tokenizers#1363

Closed

Minor improvements in GPT2 tokenizer #3567

Merged

snowyu mentioned this issue Jan 11, 2024

[Feat]: Add Tokenizer Metadata in tokenizer.json to gguf Format for Enhanced llama.cpp Capabilities #4868

Closed

4 tasks

bobqianic mentioned this issue Feb 1, 2024

Switch to BPE tokenization impl from openai/tiktoken ggerganov/whisper.cpp#1118

Open

3 tasks

bobqianic mentioned this issue Feb 10, 2024

Fix bpe_gpt2_preprocess #5446

Closed

hiepxanh mentioned this issue Feb 20, 2024

BERT wordpiece tokenizer differers from official HF implementation #5496

Closed

github-actions bot added the stale label Mar 22, 2024

github-actions bot closed this as completed Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

ggerganov commented Oct 6, 2023 •

edited

Loading

goerch commented Oct 6, 2023

goerch commented Oct 6, 2023

goerch commented Oct 7, 2023 •

edited

Loading

jploski commented Oct 9, 2023

ggerganov commented Oct 10, 2023

goerch commented Oct 10, 2023

jploski commented Oct 10, 2023 •

edited

Loading

apage43 commented Oct 10, 2023 •

edited

Loading

bobqianic commented Feb 20, 2024 •

edited

Loading

github-actions bot commented Apr 6, 2024

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization #3502

Comments

ggerganov commented Oct 6, 2023 • edited Loading

goerch commented Oct 6, 2023

goerch commented Oct 6, 2023

goerch commented Oct 7, 2023 • edited Loading

jploski commented Oct 9, 2023

ggerganov commented Oct 10, 2023

goerch commented Oct 10, 2023

jploski commented Oct 10, 2023 • edited Loading

apage43 commented Oct 10, 2023 • edited Loading

bobqianic commented Feb 20, 2024 • edited Loading

github-actions bot commented Apr 6, 2024

ggerganov commented Oct 6, 2023 •

edited

Loading

goerch commented Oct 7, 2023 •

edited

Loading

jploski commented Oct 10, 2023 •

edited

Loading

apage43 commented Oct 10, 2023 •

edited

Loading

bobqianic commented Feb 20, 2024 •

edited

Loading