-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325
Comments
It's the main feature of R1 series, not a bug.... |
I am not sure what the right solution to this problem would be. Especially for the llama-server application instead of the cli. On one hand, you do want to be able to view the contents between the thinking tags for 2 reasons:
Having said that, Deepseek does recommend removing the content in between the thinking tags when dealing with multi-turn conversations as it blows up the context needlessly and makes it more difficult for the model to focus on the actual answer it ended up giving. If inference software leaves this content in, then clients will have to remove it to support proper multi-turn conversations with these models. That's a LOT of duplicate work for all the clients potentially using these models. On the other hand, removing them in the inference software prevents you from accessing this information at all. Perhaps functionality to filter this out could be behind a llama-cli and llama-server switch? |
IMO you will have less performance by removing the think process from conversation. If you keep the old message as-is, llama.cpp will be able to reuse KV cache for these tokens and we only need to process only new tokens. And if you still want to do:
|
Also I should mention that support for jinja is on the horizon, and can help archive this by activating jinja engine |
I am not sure what would perform best. On one hand as you mention it will invalidate the KV cache forcing prompt re-processing. On the other hand, token generation also slows down with increasing context. So keeping the context shorter (by removing the thinking content) will make the generation faster. So this might even out in the end depending on multiple variables. Furthermore, the thinking contents are only removed from the most recent reply from the model since all previous replies would have already had their thinking content removed (And therefor have already been prompt processed in their altered form). From my understanding of how the KV cache is implemented (I might be wrong here!), in a multi turn conversation the cache would still be able to be used for all replies 1 through N-1. Only the last reply N will have to be reprocessed (as well as your own prompt of course) since that's the reply that will be edited by removing the thinking content. But that reply should be concise since the train of thought has been removed and be quick to process. Having said that, the performance conversation can become pretty moot pretty quickly if it causes the model to have poor accuracy because the thinking content is retained for older replies in multi-turn conversations. Qwen models tent to completely derail with context shifts, and with how chatty these models are, hitting the 32k context limit is really not difficult at all in multi turn conversations. Ideally functionality to remove the thinking content is implemented in the client as you mentioned, but I am not sure how realistic that is going to be. For people using Openweb UI, a solution has already been implemented thankfully: https://www.reddit.com/r/LocalLLaMA/comments/1i6b65q/better_r1_experience_in_open_webui/ I would also love to give these models a try in Aider and Cline, but right now they aren't useful due to this limitation. By the way, what is this Jinja you mentioned? Can I read up on these plans somewhere? Curious to see how this fits in with these models. |
|
I appreciate the feature, I just think that representation should be different.
|
What you're asking are "new features" and not "bugs". |
I haven't been able to test this myself, but apparently the API responses from r1 have "content" and "reasoning_content" in the body that is returned. Maybe llama-server could implement something similar as to be compatible with clients that rely on this r1 specific format? Just thinking out loud here. |
@Mushoz yes it can be a good idea, you can make a PR or a dedicated issue if you like |
I found the documentation that mentions most of the things discussed here: https://api-docs.deepseek.com/guides/reasoning_model How should this content and reasoning_content separation work in streaming mode though? I have not used the DeepSeek API myself, so I am not sure how it makes this separation when the LLM's reply is streamed to the client instead of returning the entire reply at the end. I haven't used this application myself yet, but using a proxy might also be a helpful stopgap solution for now: https://github.com/bold84/cot_proxy |
@ngxson Should adding Looking at the template here: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.json I can see: But setting the |
I don't know, you can try and tell us. The jinja template is stored inside gguf, should be picked up with |
I have already tried that, but as I mentioned the thinking tags and its content was still there. Just wanted to make sure I wasn't misunderstanding anything. |
@ngxson Do you believe it to be a bug that the tags are still there after adding the |
I'm not yet confident in how to handle this, as there are are so few reasoning models out there! but you could imagine having a tokenizer representation for "start of reasoning" and "end of reasoning" tokens, much like llama.cpp already has a concept of "end of generation" tokens. when converting a model, it could encode the with that, it could abstract the reasoning portion of the response and either include it as standard output, or separate it based on a configuration flag. i'm not sure the reasoning tokens are encoded as special today, presumably so application layers can detect and handle them differently (?), but a more structured abstraction would be ideal. (aside: has anyone actually seen the leading |
I think it is now solved with latest version, it says "Thinking" and is collapsed. |
It's not a feature if it cannot be turned off (I mean turn off the display of the think tag in the output if one chooses to do so). At least not a good 'feature'. |
Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts) I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection? |
@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example. OpenAI API also has separate parameter for max thinking tokens (max_completion_tokens) - so that's something to consider, as well. |
Feature already works well in newest llama.cpp, so I can see "Thinking" collapsed, and I can open it to see what is going on. I do not like Thinking to disappear, as it is useful to see what is inside. |
UI part could stay the same, as it is - but on the API side it could be better if it was working alike official R1 and OpenAI APIs (thinking part of the output delivered separately). |
Drafted some implementation (returning thoughts in... If anyone has a DeepSeek API key and can report back on which field they return thoughts in, that would be helpful! (they don't seem to accept new accounts) Also related, added |
FYI: |
Thanks @aotsukiqx! FYI, #11607 should make it all work (w/ |
Name and Version
I see tag, which I should not see:
with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf
Operating systems
Linux
GGML backends
CUDA
Hardware
i5 + GTX 1050 Ti 4 GB
Models
DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf
Problem description & steps to reproduce
Okay, the user just sent "Hello". I should respond in a friendly and welcoming manner. Maybe say hello back and ask how I can assist them today. Keep it simple and open-ended so they can express what they need help with.Hello! How can I assist you today?
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: