Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert.py : handle special tokens #2820

Closed
ggerganov opened this issue Aug 26, 2023 · 44 comments
Closed

convert.py : handle special tokens #2820

ggerganov opened this issue Aug 26, 2023 · 44 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@ggerganov
Copy link
Owner

Here we need to start handling special tokens in convert.py:

llama.cpp/convert.py

Lines 790 to 800 in e4324cb

def add_meta_vocab(self, vocab: Vocab) -> None:
tokens = []
scores = []
toktypes = []
# NOTE: `all_tokens` returns the the base vocabulary and added tokens
# TODO: add special tokens?
for text, score, toktype in vocab.all_tokens():
tokens.append(text)
scores.append(score)
toktypes.append(toktype)

An example is shown in convert-llama-7b-pth-to-gguf.py:

if Path(dir_model + "/tokenizer.json").is_file():
# Look for special tokens in tokenizer.json if it exists
with open(dir_model + "/tokenizer.json", "r", encoding="utf-8") as f:
tokenizer = json.load(f)
if "added_tokens" in tokenizer and Path(dir_model + "/tokenizer_config.json").is_file():
with open(dir_model + "/tokenizer_config.json", "r", encoding="utf-8") as f:
tokenizer_config = json.load(f)
if "bos_token" in tokenizer_config and tokenizer_config["bos_token"] != None:
for key in tokenizer["added_tokens"]:
if key["content"] == tokenizer_config["bos_token"]["content"]:
gguf_writer.add_bos_token_id(key["id"])
if "eos_token" in tokenizer_config and tokenizer_config["eos_token"] != None:
for key in tokenizer["added_tokens"]:
if key["content"] == tokenizer_config["eos_token"]["content"]:
gguf_writer.add_eos_token_id(key["id"])
if "unk_token" in tokenizer_config and tokenizer_config["unk_token"] != None:
for key in tokenizer["added_tokens"]:
if key["content"] == tokenizer_config["unk_token"]["content"]:
gguf_writer.add_unk_token_id(key["id"])
if "sep_token" in tokenizer_config and tokenizer_config["sep_token"] != None:
for key in tokenizer["added_tokens"]:
if key["content"] == tokenizer_config["sep_token"]["content"]:
gguf_writer.add_sep_token_id(key["id"])
if "pad_token" in tokenizer_config and tokenizer_config["pad_token"] != None:
for key in tokenizer["added_tokens"]:
if key["content"] == tokenizer_config["pad_token"]["content"]:
gguf_writer.add_pad_token_id(key["id"])
else:
# If no tokenizer.json: Look for special tokens in config.json
if "bos_token_id" in hparams and hparams["bos_token_id"] != None:
gguf_writer.add_bos_token_id(hparams["bos_token_id"])
if "eos_token_id" in hparams and hparams["eos_token_id"] != None:
gguf_writer.add_eos_token_id(hparams["eos_token_id"])
if "unk_token_id" in hparams and hparams["unk_token_id"] != None:
gguf_writer.add_unk_token_id(hparams["unk_token_id"])
if "sep_token_id" in hparams and hparams["sep_token_id"] != None:
gguf_writer.add_sep_token_id(hparams["sep_token_id"])
if "pad_token_id" in hparams and hparams["pad_token_id"] != None:
gguf_writer.add_pad_token_id(hparams["pad_token_id"])

@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers labels Aug 26, 2023
@KerfuffleV2
Copy link
Collaborator

I can look at this after #2753 (hopefully) gets merged. I was planning on doing some cleanup work and fixing the type annotations, seems like the kind of thing that would be reasonable to throw into that kind of pull as well.

@ggerganov
Copy link
Owner Author

I think we need to use a model that utilizes special tokens to test this with. I see people mentioning "OpenChat V2 x OpenOrca" when they need to handle special tokens - maybe we can try to make those work

@KerfuffleV2
Copy link
Collaborator

I think we need to use a model that utilizes special tokens to test this with.

Using a model with special tokens to test handling special tokens is an idea just crazy enough to work!

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

For BPE to work with llama models (Aquila?) convert.py should also add the merges like it is done in the falcon conversion script.

@KerfuffleV2
Copy link
Collaborator

In progress over here: #2842

@ggerganov
Copy link
Owner Author

The next step is using the special tokens in llama.cpp - any ideas what needs to be done?

My guess is we need to just update the id_to_token and token_to_id maps:

llama.cpp/llama.cpp

Lines 947 to 950 in dc07dc4

std::unordered_map<token, id> token_to_id;
std::vector<token_data> id_to_token;

@KerfuffleV2
Copy link
Collaborator

I'm not sure where discussion about this should be.

For BPE to work with llama models (Aquila?)

I've been doing some testing with https://huggingface.co/BAAI/Aquila-7B and https://huggingface.co/kfkas/Llama-2-ko-7b-Chat trying to get the BPE stuff to work.

First, it seems like all these BPE models just die in llama.cpp without #2889. Little surprised that pull has gotten no attention so far.

It also seems like the stuff in convert.py is still pretty far off even with merges being handled now. I started to try to fix some stuff in #2938

@rajveer43
Copy link

Is it available for working?

@ggerganov
Copy link
Owner Author

We now have to use these special tokens in llama.cpp

Can somebody confirm that the following is correct:

  • we load the following special tokens (e.g. open llama):
{
	"bos_token": {
		"content": "<s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"eos_token": {
		"content": "</s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"unk_token": {
		"content": "<unk>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	}
}
  • we now tokenize the following string <s>hello world</s>
  • the result is that <s> and </s> are no longer tokenized as strings, but instead they are tokenized to the special tokens BOS and EOS. So we get for example the tokens: [1, 22172, 3186, 2]

@KerfuffleV2
Copy link
Collaborator

I don't know what behavior is considered correct, but it seems like in that particular case it means you can't talk about HTML strikethrough tags anymore. I.E. a prompt like "Dear LLaMA model, people make a list according to such and such rules. Surround elements that meet a certain criteria with strikethrough like <s>item</s>." You'll pretty much immediately get nonsense if <s> and </s> are tokenized to BOS/EOS — same thing when the special tokens can conflict with something else that could plausibly be in a prompt.

It's less of an issue when the special tokens are like <|endoftext|> or whatever since it's less like to be something a user would write.

@klosax
Copy link
Contributor

klosax commented Sep 3, 2023

Normally the special token strings should not be recognized as special tokens in the user prompt. Better to have a CLI parameter for users who need to use them. Instead of using the model vocab these tokens should be user-configurable. Something like --bos-token "<|my-bos-token|>" should work.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 3, 2023

What if it was something like --set-token "1=<|my-bos-token|>"? Then there would be a general facility to override any token id, sort of like setting the logit overrides. (Maybe --set-token isn't great, could be --override-token, --assign-token, whatever.)

@klosax
Copy link
Contributor

klosax commented Sep 3, 2023

Then there would be a general facility to override any token id, sort of like setting the logit overrides.

The token ids for the special tokens may differ from model to model like the default mapping strings. So any external use of the special tokens should not depend on knowing the token ids.

@klosax
Copy link
Contributor

klosax commented Sep 3, 2023

In addition, a CLI parameter for enabling or disabling printing the special tokens in the output from the model would be good.

@KerfuffleV2
Copy link
Collaborator

So any external use of the special tokens should not depend on knowing the token ids.

Decent point. What I was talking about could still work with a small adaptation that you could use stuff like bos, unk, etc in addition to ids. I.E. --override-token "bos=<|my-bos-token|>".

@klosax
Copy link
Contributor

klosax commented Sep 3, 2023

What I was talking about could still work with a small adaptation that you could use stuff like bos, unk, etc in addition to ids.

Yes, that could also work.

Here is a snippet of the tinystories dataset. To correctly tokenize this dataset independent of model, a parameter for setting the EOS token is needed.

@l3utterfly
Copy link
Contributor

Hi, is handling special tokens working in the latest master branch? I tested with https://huggingface.co/openchat/openchat_v3.2_super

It doesn't seem to work. I added a print and exit in convert.py, logging SpecialVocab, the special tokens don't seem to be picked up yet.

@ggerganov
Copy link
Owner Author

ggerganov commented Sep 13, 2023

I need to understand how special tokens work. If they are not parsed during prompt processing, then I don't understand what is their purpose at all.

@l3utterfly
Copy link
Contributor

l3utterfly commented Sep 15, 2023

From my understanding:

Special tokens are used in finetunes to provide better structure in LLM's output.

  1. They are custom defined for each finetune (for example Openchat finetune uses the <|end_of_turn|> token after each person in a conversation. So this means they are guaranteed to not be present in the base model.
  2. Training on data formatted to use these tokens will provide better results generally, because the model will know to activate weights that are related to the finetune when it sees the special tokens as part of the input. This coerces the model into outputs structures more related to the training format.
  3. It helps end-user applications in parsing the output of the LLM. For example, when any BOS or EOT (end of turn) is hit, the end-user application can apply logic such as stopping the output and wait for more input. Kind of the same way as how "reverse prompt" works in llama.cpp, but more generalised. For example, "pass" tokens can be used to "pass" the conversation to other agents in multi-agent conversations.
  4. Users can also use the special tokens as part of their prompt. A tokeniser that supports the special tokens will automatically parse them correctly. For example, a prompt could be: User: Hello<|end_of_turn|>Assistant:

Special tokens are defined by the organisers of each dataset during their finetune respectively, so what their uses are depends. So I think it's a good feature to support arbitrary special tokens in llama.cpp convert script by reading the "add_tokens.json" and adding them to the GGUF. Users of those finetunes will know how to use the special tokens at their end as long as those tokens are outputted by the LLM.

A drawback of the special tokens is that yes, when defined thoughtlessly, it will conflict with the output, as in the case of </s>, which means the model cannot talk about HTML strikethroughs. This tradeoff is usually handled by the finetune-ers themselves. As in the example of Openchat again, the <|end_or_turn|> token is chosen so the probability of it coming up in conversations is astronomically low that the finetune-ers consider that acceptable.

@ggerganov
Copy link
Owner Author

@l3utterfly Thank you - I think this description gives an answer to my question earlier.

So based on this, I think the only part that is currently missing is to replace the special token pieces (i.e. the text such as <s>, <|end_of_turn|>, etc) into the KEY_TOKENIZER_LIST before writing the vocab with gguf.py and it should work.

@KerfuffleV2 Are you interested in looking into this? Probably just SpecialVocab::add_to_gguf has to be updated

To test this, after updating gguf.py and converting a model that has the following special tokens:

{
	"bos_token": {
		"content": "<s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"eos_token": {
		"content": "</s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"unk_token": {
		"content": "<unk>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	}
}

main should tokenize the string Hello world</s> as [1, 22172, 3186, 2].

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 15, 2023

Sure, I can look at it. Not sure I'm 100% clear on what needs to happen from the conversion side so I may need to ask some follow up questions. I'll see what I can figure out on my own first.

edit: I think it's definitely going to be a lot more complicated than just changing add_to_gguf though. That function doesn't have access to the full vocab list, it just calls add_bos_token_id, etc. Also SpecialVocab just handles fixed list of special tokens like bos that have an add_BLAH_token_id function in GGUFWrite but presumably we want to support arbitrary special tokens that may not fall into that set.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 15, 2023

@ggerganov Actually, I'm confused. We already write the text content for special tokens like BOS and llama.cpp seems to already know what the content is for the tokens. For example, when starting up:

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'

So I think the C++ tokenizer side is not using the token content of tokens like BOS when tokenizing rather than this being something that could be fixed on the model conversion side. Or am I misunderstanding something?

edit: Not sure if it's significant for this but BOS, EOS get added with token type control (3) and UNK gets added with token type unknown (2). Possibly that's why they're getting ignored when tokenizing.

@ggerganov
Copy link
Owner Author

We already write the text content for special tokens like BOS and llama.cpp seems to already know what the content is for the tokens.

Ah, I guess the Python classes somehow already took care of that. Then I think we are done - special tokens should already work. Can you check how the Hello world</s> tokenizes with main --verbose-prompt?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 15, 2023

It tokenizes like:

     1 -> ''
 16644 -> ' Hello'
   924 -> ' world'
  1089 -> '</'
 31829 -> 's'
 31901 -> '>'

I've been messing around with the C++ side and I don't really understand what's going on. I thought maybe it was because the </s> token had a score of 0.0 but setting it to 10000000.0 doesn't do anything. Setting its token type to 1 (normal) doesn't do anything when tokenizing either (setting BOS to normal makes it render as <s> though).

Dumping the text in llama_tokenizer_spm::tokenize looks like:

-- ▁Hello▁world</s>
-- Hello▁world</s>
-- ello▁world</s>
-- llo▁world</s>
-- lo▁world</s>
-- o▁world</s>
-- ▁world</s>
-- world</s>
-- orld</s>
-- rld</s>
-- ld</s>
-- d</s>
-- </s>
-- /s>
-- s>
-- >

I also added some debug prints to just before resegment gets called and in resegment:

0: 8 -- ▁Hello▁world</s>
** 16644 -- [▁Hello]
6: 8 -- ▁world</s>
** 924 -- [▁world]
12: 2 -- </s>
** 1089 -- [</]
14: 1 -- s>
** 31829 -- [s]
15: 1 -- >
** 31901 -- [>]
Patch
--- a/llama.cpp
+++ b/llama.cpp
@@ -3578,6 +3585,7 @@ struct llm_tokenizer_spm {
             llm_symbol sym;
             size_t len = utf8_len(text[offs]);
             sym.text = text.c_str() + offs;
+            printf("\n-- %s\n", text.c_str() + offs);
             sym.n = std::min(len, text.size() - offs);
             offs += sym.n;
             sym.prev = index - 1;
@@ -3624,6 +3632,7 @@ struct llm_tokenizer_spm {
 
         for (int i = 0; i != -1; i = symbols[i].next) {
             auto & symbol = symbols[i];
+            printf("%d: %zu -- %s\n", i, symbol.n, symbol.text);
             resegment(symbol, output);
         }
     }
@@ -3635,9 +3644,11 @@ private:
 
         // Do we need to support is_unused?
         if (token != vocab.token_to_id.end()) {
+            printf("** %d -- [%s]\n", token->second, text.c_str());
             output.push_back((*token).second);
             return;
         }
+        printf("!! [%s]\n", text.c_str());
 
         const auto p = rev_merge.find(text);

I also tried adding some debug output to try_add_bigram:

Expand
BIG: Found 0,1: 349 -- ▁H
BIG: Found 1,2: 4301 -- He
BIG: Found 2,3: 307 -- el
BIG: Found 3,4: 608 -- ll
BIG: Found 4,5: 4685 -- lo
BIG: Not found 5,6: o▁
BIG: Found 6,7: 271 -- ▁w
BIG: Found 7,8: 679 -- wo
BIG: Found 8,9: 272 -- or
BIG: Found 9,10: 13468 -- rl
BIG: Found 10,11: 395 -- ld
BIG: Not found 11,12: d<
BIG: Found 12,13: 1089 -- </
BIG: Not found 13,14: /s
BIG: Not found 14,15: s>
left = '▁world</s>' size = 4
BIG: Not found 5,6: o▁w
BIG: Not found 6,8: ▁wo
left = 'orld</s>' size = 2
BIG: Found 6,8: 456 -- ▁wor
BIG: Not found 8,10: orl
left = 'ello▁world</s>' size = 2
BIG: Found 1,2: 13588 -- Hel
BIG: Found 2,4: 452 -- ell
left = '▁Hello▁world</s>' size = 4
BIG: Bail: -1, 0
BIG: Found 0,2: 4161 -- ▁Hel
left = 'ld</s>' size = 2
BIG: Found 8,10: 12863 -- orld
BIG: Not found 10,12: ld<
left = 'ello▁world</s>' size = 3
BIG: Found 0,2: 10555 -- ▁Hell
BIG: Found 2,5: 7090 -- ello
left = '▁world</s>' size = 6
BIG: Not found 5,6: o▁wor
BIG: Found 6,10: 924 -- ▁world
left = '▁world</s>' size = 8
BIG: Not found 5,6: o▁world
BIG: Not found 6,12: ▁world<
left = '</s>' size = 2
BIG: Not found 6,12: ▁world</
BIG: Not found 12,14: </s
left = 'ello▁world</s>' size = 4
BIG: Found 0,2: 16644 -- ▁Hello
BIG: Not found 2,6: ello▁world
left = '▁Hello▁world</s>' size = 8
BIG: Bail: -1, 0
BIG: Not found 0,6: ▁Hello▁world

It doesn't look like it tried a combination with </s>. I don't really understand how that works so maybe that's expected.

@l3utterfly
Copy link
Contributor

l3utterfly commented Sep 15, 2023

I took a look at the tokenising code in c++: llm_tokenizer_spm::tokenize. I'm not that familiar with the code, but from my limited understanding, it seems to be doing this:

  1. splitting the text into utf8 chars
  2. for each character, attempt to create a bi-gram out of it by combining two adjacent chars and looking for it in the vocab
  3. recursively combining bi-grams to look for longer matches in the vocab
  4. re-segment seems to be recursively going back up the tree finding matches (?). To be honest, I'm a little unclear the purpose of this at the moment

Debugging with the prompt: Hello</s> (BOS string is automatically added by main), I can see it's splitting the </s> because the logic identifies and merges the </ token and s tokens first.

From my understanding: this bi-gram focused tokenisation may skip over long tokens (tokens over multiple characters) because it may not merge correctly, perhaps due to the identified shorter tokens within the long token just so happens to not be divisible by two? (it seems </s> gets split into 3 tokens)

My thought is to use a greedy search on the tokens (n-grams), attempting to match tokens starting from the longest possible length. Regarding the token vocab, we can use a retrieval tree for prefix matching to speed up the search.

struct TrieNode {
    bool is_end = false;
    std::unordered_map<char, std::unique_ptr<TrieNode>> children;
    llama_vocab::id token_id;
};

class Trie {
public:
    TrieNode* root = new TrieNode();

    void insert(const std::string &word, llama_vocab::id id) {
        TrieNode* node = root;
        for (char c : word) {
            if (node->children.find(c) == node->children.end()) {
                node->children[c] = std::make_unique<TrieNode>();
            }
            node = node->children[c].get();
        }
        node->is_end = true;
        node->token_id = id;
    }

    std::pair<bool, llama_vocab::id> search(const std::string &word) {
        TrieNode* node = root;
        for (char c : word) {
            if (node->children.find(c) == node->children.end()) {
                return {false, -1};
            }
            node = node->children[c].get();
        }
        if (node->is_end) {
            return {true, node->token_id};
        }
        return {false, -1};
    }
};

The tokenize function would then be:

void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
        Trie vocabTrie;

        // Populate trie with vocabulary
        for (const auto &pair : vocab.token_to_id) {
            const llama_vocab::token &token = pair.first;
            const llama_vocab::id &id = pair.second;
            vocabTrie.insert(token, id);
        }

        size_t pos = 0;
        while (pos < text.size()) {
            size_t max_len = 0;
            llama_vocab::id max_token_id;
            
            // Check all possible sub-strings starting from pos, favoring the longest possible tokens
            for (size_t len = text.size() - pos; len >= 1; --len) {
                std::pair<bool, llama_vocab::id> search_result = vocabTrie.search(text.substr(pos, len));
                if (search_result.first) {
                    max_len = len;
                    max_token_id = search_result.second;
                    break;
                }
            }
            
            if (max_len > 0) {
                output.push_back(max_token_id);
                pos += max_len;
            } else {
                // TODO: add logic to handle the case where no token is found, 
                // such as adding individual characters to the output or advancing by 
                // the length of the next UTF-8 character.
                pos += utf8_len(text[pos]); // advances by the length of the next UTF-8 character
            }
        }
    }

I tested with the prompt Hello</s>, it seems tokenize correctly into:

main: prompt: 'Hello</s>'
main: number of tokens in prompt = 3
     1 -> ''
 15043 -> ' Hello'
     2 -> ''

Using a model that supports the EOS token, it correctly passes the conversation to the "Assistant" after </s> is reached.

A few things to note in my implementation of the tokeniser:

  1. This is a proof of concept I wrote up in a few hours after trying to understand the current tokeniser, the greedy search could be very inefficient here as it checks all possible substrings of the prompt
  2. I am not sure about the implications of the this new tokeniser returning the longest possible token matches. Also, it seems @KerfuffleV2 got a different token from me for " Hello", but it could be just due to we are testing with different models.
  3. My tokeniser doesn't handle utf8 chars at all at the moment
  4. My tokeniser ignores all invalid tokens at the moment
  5. It's constructing a new retrieval tree every time upon tokenise. Should probably make the retrieval tree the default way to store vocab if this method is to go forward

I did a few short tests with my models, the coherence of the LLM output seems normal to me.

I am wondering if this is the right direction to head in? @ggerganov

@ggerganov
Copy link
Owner Author

Looks like the right direction - although I'm not 100% sure as I don't have deep understanding of how the tokenizer works.
It is important to make sure that test-tokenizer-0-llama and test-tokenizer-1-llama still work after this change:

./bin/test-tokenizer-0-llama ../models/ggml-vocab-llama.gguf
./bin/test-tokenizer-1-llama ../models/ggml-vocab-llama.gguf

Tagging @goerch in case they might have some insight.

@ggerganov
Copy link
Owner Author

I just remembered that some time ago #1931 was proposed, but the PR remained unmerged as it was during a big refactoring work. It looks like @Igoorx proposed changes to the tokenizer to handle special tokens. Might be worth looking into that and resurrecting the PR

@KerfuffleV2
Copy link
Collaborator

Looks like that creates a special token to id map and special cases checking it: https://github.com/ggerganov/llama.cpp/pull/1931/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR2098-R2136

I guess it would be possible to take that approach without fully rewriting the tokenizer. (Also, wasn't the tokenizer initially the greedy type a long time ago and then got changed or am I remembering incorrectly?)

@goerch
Copy link
Collaborator

goerch commented Sep 16, 2023

I also added some debug prints to just before resegment gets called and in resegment

I took resegment from here.

@ggerganov , @klosax : are we talking about sentencepiece or gpt2-like tokenization here (which I only tested once with unconvincing results)? Do we have a reference model for gpt2-like tokenization like Aquila or Baichuan already under test?

If we are talking about special tokens for sentencepiece do you mean user defined tokens or is this a different extension mechanism? And indeed in this case we should try to revive #1931.

@goerch
Copy link
Collaborator

goerch commented Sep 19, 2023

(which I only tested once with unconvincing results)

I looked into some of the open issues at #3252. @KerfuffleV2 : which models are we testing here?

@goerch
Copy link
Collaborator

goerch commented Sep 20, 2023

I'm staring at the following code in BpeVocab:

    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> None:
        self.bpe_tokenizer = json.loads(open(str(fname_tokenizer), encoding="utf-8").read())
        added_tokens: dict[str, int]
        if fname_added_tokens is not None:
            # FIXME: Verify that added tokens here _cannot_ overlap with the main vocab.
            added_tokens = json.load(open(fname_added_tokens, encoding="utf-8"))
        else:
            # Fall back to trying to find the added tokens in tokenizer.json
            tokenizer_json_file = fname_tokenizer.parent / 'tokenizer.json'
            if not tokenizer_json_file.is_file():
                added_tokens = {}
            else:
                tokenizer_json = json.load(open(tokenizer_json_file, encoding="utf-8"))
                added_tokens = dict(
                    (item['content'], item['id'])
                    for item in tokenizer_json.get('added_tokens', [])
                    # Added tokens here can be duplicates of the main vocabulary.
                    if item['content'] not in self.bpe_tokenizer )

Are there known cases where fname_tokenizer <> fname_tokenizer.parent / 'tokenizer.json' (that would seem illogical to me)? Otherwise we are reading the same file twice for no reason.

@KerfuffleV2
Copy link
Collaborator

fname_tokenizer would be vocab.json for BPE and added_tokens.json possibly for fname_added_tokens. fname_tokenizer.parent is pretty much just basename, it strips off the the last element in the path. So /blah/blah/vocab.json's "parent" is /blah/blah/ and fname_tokenizer.parent / 'tokenizer.json' is just the tokenizer.json in the same directory as vocab.json.

Does the way it works make sense? Who knows! I think I was the one that added some fallback logic there but I mostly left it the way it was originally when I was was messing with that part.

@goerch
Copy link
Collaborator

goerch commented Sep 20, 2023

Does the way it works make sense? Who knows!

I certainly don't. But here is my current understanding (partly based on #3252 (comment)):

We try to support two classes of tokenizers:

  • SPM (sentencepiece)
    • SPM splits input into pieces and tokenizes, somewhere in this process we have Unicode normalization
    • SPM differentiates token types (most important ones being UNKNOWN, CONTROL, BYTE, NORMAL)
    • SPM supports pad_token, unk_token, bos_token and eos_token
  • BPE (`GPT-2 like)
    • BPE splits input by a magic regexp (not supported in C++), byte encodes the pieces with some more magic and then tokenizes
    • BPE does not directly support token types, but considers some Unicode character types in the magic regexp
    • BPE does not directly support pad_token, unk_token, bos_token and eos_token, but has something like <|endoftext|> for most of them

Both tokenizers use some kind of byte pair encoding to tokenize.

Regarding the source of complete models we have

  • Original LLaMa models
    • Tokenizer file is tokenizer.model, can be loaded by sentencepiece
    • Token types, pad_token, unk_token, bos_token and eos_token are determined by SPM
  • Huggingface models
    • Huggingface adds some cognitive burden with APIs
    • We could have at least a SPM or BPE tokenizer, determined by tokenizer_config.json (if existent?)
    • tokenizer_config.json contains information about pad_token, unk_token, bos_token and eos_token.
    • Our tokenizer file currently seems to be vocab.json, although for Aquila and Falcon I see a more complete tokenizer.json
    • We have added tokens in tokenizer.json, which could or could not be part of the vocabulary and look a lot like CONTROL tokens to me
    • Added tokens can additionally(?) be described in added_tokens.json
    • We optionally have special_tokens_map.json which contains a mix of information about CONTROL tokens and pad_token, unk_token, bos_token and eos_token
    • I don't have the slightest idea about Huggingface API revisions.

We invented something like linefeed_token additionally.

On the implementation side it seems we have tokenizer handling split across a couple of conversion scripts, gguf.py and the corresponding llama.cpp code.

Here are my most urgent questions:

  • Is there any good source of documentation for HF tokenizer (or model) files or API revisions?
  • What am I missing in the description of the the requirements?
  • Any way to simplify the requirements (my first idea would be to require the existence of tokenizer_config.json and tokenizer.json for HF models and disregard added_tokens.json and special_tokens_map.json if possible)?
  • Where should we consolidate tokenizer handling on our conversion side?

@nlpcat
Copy link

nlpcat commented Sep 24, 2023

it still has problems to support special token in starcoder like <fim_prefix> in bpe

@jploski
Copy link
Contributor

jploski commented Oct 5, 2023

The lack of handling for special tokens in llm_tokenizer_spm also affects Mistral Orca.

In SentencePiece's original implementation there is something called PrefixMatcher, which is initialized with user_defined_symbols (as the special tokens are called there). This PrefixMatcher is then used to split the input into "character sequence" in the BPE tokenizer. I suppose it skips right over the atomic/unsplittable special tokens before the main BPE algorithm begins.

The llama.cpp implementation (which is apparently a port of/inspired by the bpe_model.cc from SentencePiece linked above) instead "splits the input into utf8 chars", but without the matcher part, i.e. disregarding the atomic special tokens.

@staviq
Copy link
Contributor

staviq commented Oct 5, 2023

In SentencePiece's original implementation there is something called PrefixMatcher

Thank you, I was thinking the same thing yesterday, and couldn't find any confirmation.

I already found a way to extract unsplittable tokens directly in the tokenizer without any model/convert.py changes, I'm gonna play with this some more. I have a general idea of how to solve this with minor changes in tokenizer function.

I also found a separate approach for tokenizing in O(log N), while solving this problem in the process, by building a tree structure of token/"subtokens", and matching downwards instead of upwards ( matching full long tokens first ). I have to try this to see how consistent it would be with current tokenizer.

@jploski
Copy link
Contributor

jploski commented Oct 5, 2023

In SentencePiece's original implementation there is something called PrefixMatcher

Thank you, I was thinking the same thing yesterday, and couldn't find any confirmation.

I already found a way to extract unsplittable tokens directly in the tokenizer without any model/convert.py changes, I'm gonna play with this some more. I have a general idea of how to solve this with minor changes in tokenizer function.

I also found a separate approach for tokenizing in O(log N), while solving this problem in the process, by building a tree structure of token/"subtokens", and matching downwards instead of upwards ( matching full long tokens first ). I have to try this to see how consistent it would be with current tokenizer.

I assume you are familiar with the trie data structure? I think this is what PrefixMatcher uses. Although it may be an overkill for finding all occurrences of a couple substrings in a short body of text. Apart from that, regular expressions come to mind. (I don't know how important it is for the implementation to stay similar to SentencePiece's for comparability.)

@staviq
Copy link
Contributor

staviq commented Oct 5, 2023

trie

Oh, it has a name :) That's the exact thing I had in mind, I was using it since uni for text sorting and searching, I didn't know that's how it's called in English :)

@l3utterfly
Copy link
Contributor

l3utterfly commented Oct 5, 2023

@staviq

Here's a proof of concept I wrote a while ago that uses Trie: #2820 (comment)

Hope it helps. It tokenises special characters correctly, but I haven't had the time to add support for UTF8 chars and edge cases yet

@goerch
Copy link
Collaborator

goerch commented Oct 6, 2023

In SentencePiece's original implementation there is something called PrefixMatcher, which is initialized with user_defined_symbols (as the special tokens are called there).

What do you think about reviving #1931 as suggested in #2820 (comment)?

@teleprint-me
Copy link
Contributor

teleprint-me commented Oct 15, 2023

I'm working on an experimental solution to this problem because I keep running into it and I'm not the only one; There are plenty of other issues related to this.

I'm confident there's a way to do this without creating dependencies.

We technically do not need to rely on huggingface and I can actually see reliance on it becoming an issue of its own.

I'm in the middle of creating some utilities to dump the necessary data to mapped data structures for reuse; think of it like a programmitic hexdump, but for models.

I already created one for safetensors. My next goal is to handle it for torch models. Then for huggingface models.

If my intuition is correct, then we shouldn't really need huggingface at all which would actually be a really good thing.

It would also be flexible enough to build on top of and extend as needed.

It would create a gateway towards unifying and streamlining all model conversions as well, which is my end goal.

This comment is a copy-paste from PR #3633.

@ds5t5 @ggerganov @Green-Sky

I'd like to know if this is a path worth pursuing. Let me know.

@cebtenzzre
Copy link
Collaborator

Is this fixed by #3538?

@ggerganov
Copy link
Owner Author

I hope so. Looking for more reports if this works as expected.
I've posted an example, based on my understanding of how ChatML is supposed to work: #3475 (comment)

@ggerganov
Copy link
Owner Author

Optimistically marking this as resolved. Likely we have to take some extra look in the proposal in #3664 in order to cover all cases. And probably we need #3585 merged to be able to convert models without errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests