Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

Open
gnusupport opened this issue Jan 21, 2025 · 25 comments
Open

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

gnusupport opened this issue Jan 21, 2025 · 25 comments

Comments

@gnusupport
Copy link

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
version: 159 (80d0d6b)
built with Debian clang version 14.0.6 for x86_64-pc-linux-gnu

I see tag, which I should not see:

<think>
Okay, the user just sent "Hello". I should respond in a friendly and welcoming manner. Maybe say hello back and ask how I can assist them today. Keep it simple and open-ended so they can express what they need help with.
</think>

Hello! How can I assist you today?

with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf

Operating systems

Linux

GGML backends

CUDA

Hardware

i5 + GTX 1050 Ti 4 GB

Models

DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf

Problem description & steps to reproduce

Okay, the user just sent "Hello". I should respond in a friendly and welcoming manner. Maybe say hello back and ask how I can assist them today. Keep it simple and open-ended so they can express what they need help with.

Hello! How can I assist you today?

First Bad Commit

No response

Relevant log output

none
@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

It's the main feature of R1 series, not a bug....

@Mushoz
Copy link

Mushoz commented Jan 21, 2025

I am not sure what the right solution to this problem would be. Especially for the llama-server application instead of the cli. On one hand, you do want to be able to view the contents between the thinking tags for 2 reasons:

  1. It's interesting to see how a model got to its answer, and it wouldn't be the first time it had the correct solution at one point in its train of thought but then settled on the wrong one. So the content in between the tags is definitely interesting to have access to.
  2. If the thinking process is going off the rails, you are able to quit the generation midway when running it in streaming mode.

Having said that, Deepseek does recommend removing the content in between the thinking tags when dealing with multi-turn conversations as it blows up the context needlessly and makes it more difficult for the model to focus on the actual answer it ended up giving. If inference software leaves this content in, then clients will have to remove it to support proper multi-turn conversations with these models. That's a LOT of duplicate work for all the clients potentially using these models. On the other hand, removing them in the inference software prevents you from accessing this information at all.

Perhaps functionality to filter this out could be behind a llama-cli and llama-server switch?

@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

IMO you will have less performance by removing the think process from conversation. If you keep the old message as-is, llama.cpp will be able to reuse KV cache for these tokens and we only need to process only new tokens.

And if you still want to do:

  • for server, it should be up to the client to remove it
  • for llama-cli, almost impossible for now because it's designed to add messages, but not to go back and modify old message

@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

Also I should mention that support for jinja is on the horizon, and can help archive this by activating jinja engine

@Mushoz
Copy link

Mushoz commented Jan 21, 2025

I am not sure what would perform best. On one hand as you mention it will invalidate the KV cache forcing prompt re-processing. On the other hand, token generation also slows down with increasing context. So keeping the context shorter (by removing the thinking content) will make the generation faster. So this might even out in the end depending on multiple variables.

Furthermore, the thinking contents are only removed from the most recent reply from the model since all previous replies would have already had their thinking content removed (And therefor have already been prompt processed in their altered form). From my understanding of how the KV cache is implemented (I might be wrong here!), in a multi turn conversation the cache would still be able to be used for all replies 1 through N-1. Only the last reply N will have to be reprocessed (as well as your own prompt of course) since that's the reply that will be edited by removing the thinking content. But that reply should be concise since the train of thought has been removed and be quick to process.

Having said that, the performance conversation can become pretty moot pretty quickly if it causes the model to have poor accuracy because the thinking content is retained for older replies in multi-turn conversations. Qwen models tent to completely derail with context shifts, and with how chatty these models are, hitting the 32k context limit is really not difficult at all in multi turn conversations. Ideally functionality to remove the thinking content is implemented in the client as you mentioned, but I am not sure how realistic that is going to be.

For people using Openweb UI, a solution has already been implemented thankfully: https://www.reddit.com/r/LocalLLaMA/comments/1i6b65q/better_r1_experience_in_open_webui/

I would also love to give these models a try in Aider and Cline, but right now they aren't useful due to this limitation.

By the way, what is this Jinja you mentioned? Can I read up on these plans somewhere? Curious to see how this fits in with these models.

@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

By the way, what is this Jinja you mentioned? Can I read up on these plans somewhere? Curious to see how this fits in with these models.

#11016

@gnusupport
Copy link
Author

It's the main feature of R1 series, not a bug....

I appreciate the feature, I just think that representation should be different.

  • I don't think user is supposed to see the tags
  • I think that "thinking" progress could be somehow separate from the actual response
  • Copy function of the actual response should copy response only
  • There can be separate copy function for the "thinking" part
  • "Thinking" part could be somehow collapsed and not necessarily visible

@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

What you're asking are "new features" and not "bugs".

@Mushoz
Copy link

Mushoz commented Jan 21, 2025

I haven't been able to test this myself, but apparently the API responses from r1 have "content" and "reasoning_content" in the body that is returned. Maybe llama-server could implement something similar as to be compatible with clients that rely on this r1 specific format? Just thinking out loud here.

@ngxson
Copy link
Collaborator

ngxson commented Jan 21, 2025

@Mushoz yes it can be a good idea, you can make a PR or a dedicated issue if you like

@Mushoz
Copy link

Mushoz commented Jan 21, 2025

I found the documentation that mentions most of the things discussed here: https://api-docs.deepseek.com/guides/reasoning_model

How should this content and reasoning_content separation work in streaming mode though? I have not used the DeepSeek API myself, so I am not sure how it makes this separation when the LLM's reply is streamed to the client instead of returning the entire reply at the end.

I haven't used this application myself yet, but using a proxy might also be a helpful stopgap solution for now: https://github.com/bold84/cot_proxy

@Mushoz
Copy link

Mushoz commented Jan 22, 2025

@ngxson Should adding --jinja get rid of the content in the thinking tags? From llama-server's documentation, the default template should be the template taken from model's metadata.

Looking at the template here: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.json

I can see: {% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}

But setting the --jinja switch is not getting rid of the tags in the replies I am getting through openweb ui. Furthermore, using llama-server's builtin GUI I also still see the thinking tags. Am I misunderstanding how things should be working now?

@ngxson
Copy link
Collaborator

ngxson commented Jan 22, 2025

Should adding --jinja get rid of the content in the thinking tags?

I don't know, you can try and tell us.

The jinja template is stored inside gguf, should be picked up with --jinja, we just need to see if it's compatible with our minja engine or not.

@Mushoz
Copy link

Mushoz commented Jan 22, 2025

I have already tried that, but as I mentioned the thinking tags and its content was still there. Just wanted to make sure I wasn't misunderstanding anything.

@Mushoz
Copy link

Mushoz commented Jan 22, 2025

@ngxson Do you believe it to be a bug that the tags are still there after adding the --jinja flag? Or does that simply need the followup PRs to be merged first? Just curious if I should file a bug report about it or not.

@brittlewis12
Copy link

I'm not yet confident in how to handle this, as there are are so few reasoning models out there!

but you could imagine having a tokenizer representation for "start of reasoning" and "end of reasoning" tokens, much like llama.cpp already has a concept of "end of generation" tokens. when converting a model, it could encode the <think> and </think> tags as start and end reasoning tokens respectively.

with that, it could abstract the reasoning portion of the response and either include it as standard output, or separate it based on a configuration flag.

i'm not sure the reasoning tokens are encoded as special today, presumably so application layers can detect and handle them differently (?), but a more structured abstraction would be ideal.

(aside: has anyone actually seen the leading <think> token in the wild? i've only seen the closing tag... but i've only used the distill models so far locally.)

@gnusupport
Copy link
Author

I think it is now solved with latest version, it says "Thinking" and is collapsed.

@francogrex
Copy link

francogrex commented Jan 28, 2025

It's the main feature of R1 series, not a bug....

It's not a feature if it cannot be turned off (I mean turn off the display of the think tag in the output if one chooses to do so). At least not a good 'feature'.

@ochafik
Copy link
Collaborator

ochafik commented Jan 29, 2025

Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts)

I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection?

@MoonRide303
Copy link
Contributor

@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example.

OpenAI API also has separate parameter for max thinking tokens (max_completion_tokens) - so that's something to consider, as well.

@gnusupport
Copy link
Author

@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example.

Feature already works well in newest llama.cpp, so I can see "Thinking" collapsed, and I can open it to see what is going on. I do not like Thinking to disappear, as it is useful to see what is inside.

@MoonRide303
Copy link
Contributor

@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example.

Feature already works well in newest llama.cpp, so I can see "Thinking" collapsed, and I can open it to see what is going on. I do not like Thinking to disappear, as it is useful to see what is inside.

UI part could stay the same, as it is - but on the API side it could be better if it was working alike official R1 and OpenAI APIs (thinking part of the output delivered separately).

@ochafik
Copy link
Collaborator

ochafik commented Feb 3, 2025

Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts)

I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection?

Drafted some implementation (returning thoughts in... message.thoughts field 🤷‍♂️) here: #11607 (still no support for streaming tho)

If anyone has a DeepSeek API key and can report back on which field they return thoughts in, that would be helpful! (they don't seem to accept new accounts)

Also related, added tool_plan output to Command R7B (#11585), we might want to unify the two (although I'd rather wait and see what upcoming thinking models do)

@aotsukiqx
Copy link

aotsukiqx commented Feb 6, 2025

Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts)

I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection?

Drafted some implementation (returning thoughts in... message.thoughts field 🤷‍♂️) here: #11607 (still no support for streaming tho)

If anyone has a DeepSeek API key and can report back on which field they return thoughts in, that would be helpful! (they don't seem to accept new accounts)

Also related, added tool_plan output to Command R7B (#11585), we might want to unify the two (although I'd rather wait and see what upcoming thinking models do)

FYI:
$ curl https://api.deepseek.com/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer "
-d '{
"model": "deepseek-reasoner",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": false
}'
{"id":"aad1fd87-336f-4e32-81a2-641e1f5a7a07","object":"chat.completion","created":1738809094,"model":"deepseek-reasoner","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","reasoning_content":"Okay, the user just said "Hello!" I should respond in a friendly and welcoming manner. Let me make sure to keep it open-ended so they feel comfortable asking for help. Maybe something like, "Hello! How can I assist you today?" That sounds good. It's simple and inviting."},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":13,"completion_tokens":71,"total_tokens":84,"prompt_tokens_details":{"cached_tokens":0},"completion_tokens_details":{"reasoning_tokens":60},"prompt_cache_hit_tokens":0,"prompt_cache_miss_tokens":13},"system_fingerprint":"fp_7e73fd9a08"}%

@ochafik
Copy link
Collaborator

ochafik commented Feb 6, 2025

Thanks @aotsukiqx!

FYI, #11607 should make it all work (w/ --think flag to return thoughts in the reasoning_content field), testing/feedback welcome (build & test instructions included in the PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants