Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. #4331

Closed
4 tasks done
Noeda opened this issue Dec 4, 2023 · 13 comments
Closed
4 tasks done

Comments

@Noeda
Copy link
Contributor

Noeda commented Dec 4, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Description

(This is specifically for the latest 72B models. I have never tried the smaller ones).

I'm using this model: https://huggingface.co/Qwen/Qwen-72B-Chat

Commit: 33e171d1e9fc4903f9314b490d77fb8d58331b63

I think the current convert-hf-to-gguf.py does not produce a .gguf file that treats these two tokens correctly for <|im_start|> and <|im_end|>.

The prompt I used is "<|im_start|>system" for the examples below.

Following steps in #4281 to produce some .gguf files (I personally used the Q6_K on a MacStudio) I tried the tokenize tool:

    27 -> '<'
    91 -> '|'
   318 -> 'im'
  4906 -> '_start'
    91 -> '|'
    29 -> '>'
  8948 -> 'system'

Compare this to a Yi model with exact same prompt:

     6 -> '<|im_start|>'
 10707 -> 'system'

I saw the Qwen model code (https://huggingface.co/Qwen/Qwen-72B/blob/main/tokenization_qwen.py#L37) and I think these are intended to be single tokens. But the current script does not handle it properly.

Steps to Reproduce

  1. Download the Qwen models. (https://huggingface.co/Qwen/Qwen-72B-Chat)
  2. Use the convert-hf-to-gguf.py script to convert one into a .gguf file. (This is the exact command I found on my Mac Studio: python3 convert-hf-to-gguf.py --outfile /Volumes/T9/qwen_72b_chat_v3_f16.gguf --outtype f16 ~/text-generation-webui/models/Qwen_Qwen-72B-Chat)
  3. Run tokenize on them to see what tokens are interpreted.

If I'm honest, I'm not sure if this would be a bug to llama.cpp repository or something Qwen team might want to fix in their repo. But I'm submitting it here for awareness.

Also, the model seems to work fine despite this. But maybe it would work better if they were interpreted correctly? No idea.

@Noeda Noeda changed the title Qwen conversion script does not treat <|im_start|> and <|im_end|> correctly. Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly. Dec 4, 2023
@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 4, 2023

You should link the exact model you used, make sure it's what you have locally and not something else. Make sure you have ALL files from the HF dump in your local directory and that nothing is cut or missing.

What you describe sounds like an error in the tokenizer import, either a wrong tokenizer is used or the special tokens were not read in.
For usual models (all I tested in the past month) this always worked.

Just as you noticed: if that tokenization is not working then the finetune is broken, so that really has to be one token.

@Noeda
Copy link
Contributor Author

Noeda commented Dec 4, 2023

@cmp-nct I used this model: https://huggingface.co/Qwen/Qwen-72B-Chat (I edited to include that in the original post)

I used this command to convert it into .gguf: (also in my original comment):

python3 convert-hf-to-gguf.py --outfile /Volumes/T9/qwen_72b_chat_v3_f16.gguf --outtype f16 ~/text-generation-webui/models/Qwen_Qwen-72B-Chat

The Qwen_Qwen-72B-Chat I have verified is exactly the model as on HuggingFace repo I linked. (well, at least in file names. It's not obviously diverging, but I admit I didn't do super thorough checking like md5sums over all the files and compare them)

The model works fine. I just think it may not be working optimally because the tokens are not treated specially.

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 4, 2023

Okay that looks bad at first glance.

    ENDOFTEXT = "<\|endoftext\|>"
  | IMSTART = "<\|im_start\|>"
  | IMEND = "<\|im_end\|>"
  | # as the default behavior is changed to allow special tokens in
  | # regular texts, the surface forms of special tokens need to be
  | # as different as possible to minimize the impact
  | EXTRAS = tuple((f"<\|extra_{i}\|>" for i in range(205)))
  | # changed to use actual index to avoid misconfiguration with vocabulary expansion
  | SPECIAL_START_ID = 151643
  | SPECIAL_TOKENS = tuple(
  | enumerate(
  | (
  | (
  | ENDOFTEXT,
  | IMSTART,
  | IMEND,
  | )

So I am guessing, I do not know if we have special handling code for "QWEN" implemented, because it would be required.
There are two important differences:

  1. It uses a custom tokenizer! It has similarities to GPT-2 tokenization but the regex looks different (would need to be checked).
    If the regex is different the tokens would be split in a different way - most likely similar for english language and different for chinese languages

  2. It assigns the 3 special tokens manually ID is 151643,151644, 151645 in that tokenizer python file
    so what you can do as a quick fix: you can hack the json files and add the 3 tokens with their correct IDs
    Usually you find it in two json files, you can edit them with any proper text editor (compare it with YI files)

Once that is done you'd have correct tokens for the 3 special characters, however there will be more issues.
I've not looked into qwen so what I said needs a grain of salt and the assumption that we do not have QWEN handling code somewhere. Maybe there are open threads here that discuss it, maybe you'll find a PR handling it.

If you decide to continue and hack those 3 tokens into the file make sure you run the python demo implementation, then add a print to the tokenizer output (that's what is fed into the generation, one of the main lines).
This way you can compare the tokenization of a prompt and look if the llama.cpp one produces similar results

@Noeda
Copy link
Contributor Author

Noeda commented Dec 5, 2023

Also, for record, I am on commit 33e171d1e9fc4903f9314b490d77fb8d58331b63 *((Sun Dec 3 01:10:43 2023 -0800). There are some later commits as of I'm writing this comment but they don't look relevant to this particular issue.

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 5, 2023

https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee

Try that, it might convert their tokenizer format to HF, if that was not already done.
I checked pull requests a bit and it looks like QWEN is likely working, possibly a bias parameter is missing (though if that wasn't added yet I'm doubtful you'd get any good results from it)

So that script is worth a check, if that doesn't help, try hack in the 3 special tokens with a text editor, then convert the model again and it might be good

@sorasoras
Copy link

sorasoras commented Jan 17, 2024

I encounter this problem as well in QWEN13B models
it appear randomly during the conversation.
297096726-75e1d297-54dc-44dc-aae5-c27cff944601

@xd2333
Copy link

xd2333 commented Jan 18, 2024

in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k)
2024-1-18 20-38-5

@sakura-umi
Copy link
Contributor

sakura-umi commented Jan 18, 2024

I encounter this problem as well in QWEN13B models it appear randomly during the conversation. 297096726-75e1d297-54dc-44dc-aae5-c27cff944601

in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k) 2024-1-18 20-38-5

Your situation may not be the same as the author's. In my situation, b1892 reproduces the bug you got but b1833 doesn't. And the special tokens are both correctly treated, since the tokenize tool has an expected behaviour below.

image

alt "left for b1833 and right for b1892."

left for b1833 and right for b1892.

But the difference between b1833 and b1892 is that special token has a certain text in b1892 (here is '<|im_end|>'), while in b1833 it does not (here is '').

So I think this is why you got this bug.


Conclusion: b1859 fix a detokenization bug about non-special added tokens, but the convert script does not treat these special tokens(like 151643, 151644, 151645) as control tokens.

Solution: just make the reverse_vocab dict contains special tokens of the tokenizer, and everything is ok.

@compilade
Copy link
Collaborator

compilade commented Jan 22, 2024

in my case, [PAD151645] [PAD151644] [PAD151643] shows in output(Qwen-14B-Chat q4k)

Regarding the [PAD{id}] tokens, I recently fixed this in #5052, but it requires re-converting existing Qwen models.

Solution: just make the reverse_vocab dict contains special tokens of the tokenizer, and everything is ok.

Funny, that's pretty much exactly what I did in the aforementioned pull-request.

I can reproduce the original problem (described by @Noeda) when converting Qwen-1.8B-Chat with llama.cpp commit 152d9d0 (tag b1941):

$ # Convert the model (well, only the vocab, since that's the only thing used in the following test)
$ python3 llama.cpp/convert-hf-to-gguf.py --vocab-only --outfile qwen-1.8b-chat-b1941-vocab.gguf Qwen-1_8B-Chat/
...
$ # Test the tokenizer
$ ./llama.cpp/result/bin/tokenize qwen-1.8b-chat-b1941-vocab.gguf "<|im_start|>system"
...
llm_load_print_meta: BOS token        = 151643 '[PAD151643]'
llm_load_print_meta: EOS token        = 151643 '[PAD151643]'
llm_load_print_meta: UNK token        = 151643 '[PAD151643]'
...
    27 -> '<'
    91 -> '|'
   318 -> 'im'
  4906 -> '_start'
    91 -> '|'
    29 -> '>'
  8948 -> 'system'

When I instead convert it with the more recent d6bd4d4 (tag b1942, which is when #5052 was merged), the tokenization works correctly 🎉 :

$ # Convert the model's vocab
$ python3 llama.cpp/convert-hf-to-gguf.py --vocab-only --outfile qwen-1.8b-chat-b1942-vocab.gguf Qwen-1_8B-Chat/
...
$ # Test the tokenizer
$ ./llama.cpp/result/bin/tokenize qwen-1.8b-chat-b1942-vocab.gguf "<|im_start|>system"
...
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151643 '<|endoftext|>'
...
151644 -> ''
  8948 -> 'system'

So, <|im_start|> is now properly understood as token 151644. Also notice that the various [PAD{id}] tokens now have their correct names (in this case <|endoftext|>).

In both cases, I used the same tokenize binary (from an even more recent commit, 6f9939d), so the problem really is inside the previously-badly-converted model files.

I don't know to what extent this affects the existing GGUF-converted Qwen models on HuggingFace, but I think most of them will need to be reconverted with a recent llama.cpp version.

If anyone reading this is unsure whether their Qwen GGUF model(s) needs re-conversion, run the tokenize example on the model and check if <|im_start|> is properly read as a single token (151644) instead of multiple, and also check if the BOS, EOS, and UNK tokens are not called [PAD151643].

@Noeda
Copy link
Contributor Author

Noeda commented Jan 23, 2024

Regarding the [PAD{id}] tokens, I recently fixed this in #5052, but it requires re-converting existing Qwen models.

I don't know to what extent this affects the existing GGUF-converted Qwen models on HuggingFace, but I think most of them will need to be reconverted with a recent llama.cpp version.

If anyone reading this is unsure whether their Qwen GGUF model(s) needs re-conversion, run the tokenize example on the model and check if <|im_start|> is properly read as a single token (151644) instead of multiple, and also check if the BOS, EOS, and UNK tokens are not called [PAD151643].

Hey, thanks for the fix @compilade . I originally opened this issue, not so much that had a big interest in getting it fixed for myself, but because I thought community might want to aware that Qwen models seems to have issues with llama.cpp and have a placeholder to discuss it (I guess that worked out 😄).

I still seem to have the original Qwen-72B .gguf files and HuggingFace files on my computer from the time I opened this issue. I'm going to try rebake the .ggufs again with latest llama.cpp and test out the tokenizer, to confirm if it's all now working for me. I'll make a comment here soon to verify if I can reproduce the fix for my original files I used when opening this. Stay tuned.

Small edit to clarify: Specifically I mean Qwen-72B-Chat for all that. I can't remember on top of my head if the base Qwen-72B also had these tokens special (but I would expect yes).

@Noeda
Copy link
Contributor Author

Noeda commented Jan 23, 2024

My llama.cpp commit: 011e8ec577fd135cbc02993d3ea9840c516d6a1c for tests below:

I can confirm that the fixes @compilade mentions are doing their job. I replicated the same steps as I had in my original post on this issue and created a qwen_72b_chat_v2_Q6_K.gguf with a conversion and then quantize tool from the HuggingFace Qwen-72B-Chat files.

llm_load_print_meta: general.name     = Qwen
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llama_model_load: vocab only - skipping tensors
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
151644 -> ''
  8948 -> 'system'

Also did a quick dirty test that the model itself is not broken.

~/llama.cpp/build/bin/main --model ./qwen_72b_chat_v2_Q6_K.gguf --prompt "<|im_start|>user\nplease repeat back: apple, orange, juice<|im_end|>"
...

sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


user\nplease repeat back: apple, orange, juice
="apple, orange, juice"

 [end of text]

llama_print_timings:        load time =    2651.02 ms
llama_print_timings:      sample time =       6.57 ms /    17 runs   (    0.39 ms per token,  2589.49 tokens per second)
llama_print_timings: prompt eval time =     427.73 ms /    13 tokens (   32.90 ms per token,    30.39 tokens per second)
llama_print_timings:        eval time =    1678.20 ms /    16 runs   (  104.89 ms per token,     9.53 tokens per second)
llama_print_timings:       total time =    2119.22 ms /    29 tokens
ggml_metal_free: deallocating
Log end

(The command line tool for generating text doesn't seem to print out <|im_start|> or <|im_end|> but maybe that's a separate issue or intended).

Thank you for the fixes!

@compilade
Copy link
Collaborator

compilade commented Jan 23, 2024

I originally opened this issue, not so much that had a big interest in getting it fixed for myself, but because I thought community might want to aware that Qwen models seems to have issues with llama.cpp and have a placeholder to discuss it (I guess that worked out 😄).

You did the right thing, apparently, since it was a real problem :)

user\nplease repeat back: apple, orange, juice

Just a quick note: you can use the -e flag so that a \n in the prompt (passed with -p or --prompt) is given as a newline to the model. Also, there's the --chatml flag which automatically starts an interactive chat using ChatML templates if you don't want to write them manually.

(The command line tool for generating text doesn't seem to print out <|im_start|> or <|im_end|> but maybe that's a separate issue or intended).

It seems intended. The CONTROL tokens do not detokenize to their text. See the comment near the relevant code. It doesn't really explain why, though.

@github-actions github-actions bot added the stale label Mar 24, 2024
Copy link
Contributor

github-actions bot commented Apr 8, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants