Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

gnusupport · 2025-01-21T06:00:38Z

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
version: 159 (80d0d6b)
built with Debian clang version 14.0.6 for x86_64-pc-linux-gnu

I see tag, which I should not see:

<think>
Okay, the user just sent "Hello". I should respond in a friendly and welcoming manner. Maybe say hello back and ask how I can assist them today. Keep it simple and open-ended so they can express what they need help with.
</think>

Hello! How can I assist you today?

with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf

Operating systems

Linux

GGML backends

CUDA

Hardware

i5 + GTX 1050 Ti 4 GB

Models

DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf

Problem description & steps to reproduce

Okay, the user just sent "Hello". I should respond in a friendly and welcoming manner. Maybe say hello back and ask how I can assist them today. Keep it simple and open-ended so they can express what they need help with.

Hello! How can I assist you today?

First Bad Commit

No response

Relevant log output

none

The text was updated successfully, but these errors were encountered:

ngxson · 2025-01-21T08:24:24Z

It's the main feature of R1 series, not a bug....

Mushoz · 2025-01-21T08:28:45Z

I am not sure what the right solution to this problem would be. Especially for the llama-server application instead of the cli. On one hand, you do want to be able to view the contents between the thinking tags for 2 reasons:

It's interesting to see how a model got to its answer, and it wouldn't be the first time it had the correct solution at one point in its train of thought but then settled on the wrong one. So the content in between the tags is definitely interesting to have access to.
If the thinking process is going off the rails, you are able to quit the generation midway when running it in streaming mode.

Having said that, Deepseek does recommend removing the content in between the thinking tags when dealing with multi-turn conversations as it blows up the context needlessly and makes it more difficult for the model to focus on the actual answer it ended up giving. If inference software leaves this content in, then clients will have to remove it to support proper multi-turn conversations with these models. That's a LOT of duplicate work for all the clients potentially using these models. On the other hand, removing them in the inference software prevents you from accessing this information at all.

Perhaps functionality to filter this out could be behind a llama-cli and llama-server switch?

ngxson · 2025-01-21T08:32:49Z

IMO you will have less performance by removing the think process from conversation. If you keep the old message as-is, llama.cpp will be able to reuse KV cache for these tokens and we only need to process only new tokens.

And if you still want to do:

for server, it should be up to the client to remove it
for llama-cli, almost impossible for now because it's designed to add messages, but not to go back and modify old message

ngxson · 2025-01-21T08:34:11Z

Also I should mention that support for jinja is on the horizon, and can help archive this by activating jinja engine

Mushoz · 2025-01-21T09:01:08Z

I am not sure what would perform best. On one hand as you mention it will invalidate the KV cache forcing prompt re-processing. On the other hand, token generation also slows down with increasing context. So keeping the context shorter (by removing the thinking content) will make the generation faster. So this might even out in the end depending on multiple variables.

Furthermore, the thinking contents are only removed from the most recent reply from the model since all previous replies would have already had their thinking content removed (And therefor have already been prompt processed in their altered form). From my understanding of how the KV cache is implemented (I might be wrong here!), in a multi turn conversation the cache would still be able to be used for all replies 1 through N-1. Only the last reply N will have to be reprocessed (as well as your own prompt of course) since that's the reply that will be edited by removing the thinking content. But that reply should be concise since the train of thought has been removed and be quick to process.

Having said that, the performance conversation can become pretty moot pretty quickly if it causes the model to have poor accuracy because the thinking content is retained for older replies in multi-turn conversations. Qwen models tent to completely derail with context shifts, and with how chatty these models are, hitting the 32k context limit is really not difficult at all in multi turn conversations. Ideally functionality to remove the thinking content is implemented in the client as you mentioned, but I am not sure how realistic that is going to be.

For people using Openweb UI, a solution has already been implemented thankfully: https://www.reddit.com/r/LocalLLaMA/comments/1i6b65q/better_r1_experience_in_open_webui/

I would also love to give these models a try in Aider and Cline, but right now they aren't useful due to this limitation.

By the way, what is this Jinja you mentioned? Can I read up on these plans somewhere? Curious to see how this fits in with these models.

ngxson · 2025-01-21T09:20:16Z

By the way, what is this Jinja you mentioned? Can I read up on these plans somewhere? Curious to see how this fits in with these models.

#11016

gnusupport · 2025-01-21T09:33:12Z

It's the main feature of R1 series, not a bug....

I appreciate the feature, I just think that representation should be different.

I don't think user is supposed to see the tags
I think that "thinking" progress could be somehow separate from the actual response
Copy function of the actual response should copy response only
There can be separate copy function for the "thinking" part
"Thinking" part could be somehow collapsed and not necessarily visible

ngxson · 2025-01-21T09:40:32Z

What you're asking are "new features" and not "bugs".

Mushoz · 2025-01-21T12:55:34Z

I haven't been able to test this myself, but apparently the API responses from r1 have "content" and "reasoning_content" in the body that is returned. Maybe llama-server could implement something similar as to be compatible with clients that rely on this r1 specific format? Just thinking out loud here.

ngxson · 2025-01-21T13:12:37Z

@Mushoz yes it can be a good idea, you can make a PR or a dedicated issue if you like

Mushoz · 2025-01-21T14:57:11Z

I found the documentation that mentions most of the things discussed here: https://api-docs.deepseek.com/guides/reasoning_model

How should this content and reasoning_content separation work in streaming mode though? I have not used the DeepSeek API myself, so I am not sure how it makes this separation when the LLM's reply is streamed to the client instead of returning the entire reply at the end.

I haven't used this application myself yet, but using a proxy might also be a helpful stopgap solution for now: https://github.com/bold84/cot_proxy

Mushoz · 2025-01-22T12:22:59Z

@ngxson Should adding --jinja get rid of the content in the thinking tags? From llama-server's documentation, the default template should be the template taken from model's metadata.

Looking at the template here: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/main/tokenizer_config.json

I can see: {% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}

But setting the --jinja switch is not getting rid of the tags in the replies I am getting through openweb ui. Furthermore, using llama-server's builtin GUI I also still see the thinking tags. Am I misunderstanding how things should be working now?

ngxson · 2025-01-22T12:40:01Z

Should adding --jinja get rid of the content in the thinking tags?

I don't know, you can try and tell us.

The jinja template is stored inside gguf, should be picked up with --jinja, we just need to see if it's compatible with our minja engine or not.

Mushoz · 2025-01-22T12:52:34Z

I have already tried that, but as I mentioned the thinking tags and its content was still there. Just wanted to make sure I wasn't misunderstanding anything.

Mushoz · 2025-01-22T19:15:25Z

@ngxson Do you believe it to be a bug that the tags are still there after adding the --jinja flag? Or does that simply need the followup PRs to be merged first? Just curious if I should file a bug report about it or not.

brittlewis12 · 2025-01-25T17:49:54Z

I'm not yet confident in how to handle this, as there are are so few reasoning models out there!

but you could imagine having a tokenizer representation for "start of reasoning" and "end of reasoning" tokens, much like llama.cpp already has a concept of "end of generation" tokens. when converting a model, it could encode the <think> and </think> tags as start and end reasoning tokens respectively.

with that, it could abstract the reasoning portion of the response and either include it as standard output, or separate it based on a configuration flag.

i'm not sure the reasoning tokens are encoded as special today, presumably so application layers can detect and handle them differently (?), but a more structured abstraction would be ideal.

(aside: has anyone actually seen the leading <think> token in the wild? i've only seen the closing tag... but i've only used the distill models so far locally.)

gnusupport · 2025-01-26T17:40:41Z

I think it is now solved with latest version, it says "Thinking" and is collapsed.

francogrex · 2025-01-28T14:50:58Z

It's the main feature of R1 series, not a bug....

It's not a feature if it cannot be turned off (I mean turn off the display of the think tag in the output if one chooses to do so). At least not a good 'feature'.

ochafik · 2025-01-29T22:52:45Z

Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts)

I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection?

MoonRide303 · 2025-01-30T11:14:56Z

@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example.

OpenAI API also has separate parameter for max thinking tokens (max_completion_tokens) - so that's something to consider, as well.

gnusupport · 2025-01-30T11:53:44Z

@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example.

Feature already works well in newest llama.cpp, so I can see "Thinking" collapsed, and I can open it to see what is going on. I do not like Thinking to disappear, as it is useful to see what is inside.

MoonRide303 · 2025-01-30T11:58:30Z

@ochafik Filtering out thinking part from the API output would be nice a nice feature (so we won't have to strip it manually) - and that's how both o1 and R1 models work, when called via the API. I am not sure about o1, but in case of R1 thoughts are accessible via reasoning_content field - accessible with standard openai-python client, see their example.

Feature already works well in newest llama.cpp, so I can see "Thinking" collapsed, and I can open it to see what is going on. I do not like Thinking to disappear, as it is useful to see what is inside.

UI part could stay the same, as it is - but on the API side it could be better if it was working alike official R1 and OpenAI APIs (thinking part of the output delivered separately).

ochafik · 2025-02-03T13:08:14Z

Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts)

I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection?

Drafted some implementation (returning thoughts in... message.thoughts field 🤷‍♂️) here: #11607 (still no support for streaming tho)

If anyone has a DeepSeek API key and can report back on which field they return thoughts in, that would be helpful! (they don't seem to accept new accounts)

Also related, added tool_plan output to Command R7B (#11585), we might want to unify the two (although I'd rather wait and see what upcoming thinking models do)

aotsukiqx · 2025-02-06T02:32:58Z

Quick note that with #9639 and its chat format handlers we could easily skip a model-specific thoughts it its custom handler (although I do love being able to inspect said thoughts)

I wonder if it would be worth returning the thoughts as a different field in the response maybe? (assuming that doesn't break oai python/js clients). Only if the client asks for some introspection?

Drafted some implementation (returning thoughts in... message.thoughts field 🤷‍♂️) here: #11607 (still no support for streaming tho)

If anyone has a DeepSeek API key and can report back on which field they return thoughts in, that would be helpful! (they don't seem to accept new accounts)

Also related, added tool_plan output to Command R7B (#11585), we might want to unify the two (although I'd rather wait and see what upcoming thinking models do)

FYI:
$ curl https://api.deepseek.com/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer "
-d '{
"model": "deepseek-reasoner",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": false
}'
{"id":"aad1fd87-336f-4e32-81a2-641e1f5a7a07","object":"chat.completion","created":1738809094,"model":"deepseek-reasoner","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","reasoning_content":"Okay, the user just said "Hello!" I should respond in a friendly and welcoming manner. Let me make sure to keep it open-ended so they feel comfortable asking for help. Maybe something like, "Hello! How can I assist you today?" That sounds good. It's simple and inviting."},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":13,"completion_tokens":71,"total_tokens":84,"prompt_tokens_details":{"cached_tokens":0},"completion_tokens_details":{"reasoning_tokens":60},"prompt_cache_hit_tokens":0,"prompt_cache_miss_tokens":13},"system_fingerprint":"fp_7e73fd9a08"}%

ochafik · 2025-02-06T10:55:43Z

Thanks @aotsukiqx!

FYI, #11607 should make it all work (w/ --think flag to return thoughts in the reasoning_content field), testing/feedback welcome (build & test instructions included in the PR)

gnusupport added the bug-unconfirmed label Jan 21, 2025

noahhaon mentioned this issue Jan 21, 2025

Misc. bug: Deepseek R1 incompatible with grammars / structured output #11336

Open

stduhpf mentioned this issue Jan 23, 2025

webui : put DeepSeek R1 CoT in a collapsible <details> element #11364

Merged

gnusupport mentioned this issue Jan 26, 2025

Not handling <think> tags Taewan-P/gpt_mobile#171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

gnusupport commented Jan 21, 2025

ngxson commented Jan 21, 2025

Mushoz commented Jan 21, 2025 •

edited

Loading

ngxson commented Jan 21, 2025 •

edited

Loading

ngxson commented Jan 21, 2025

Mushoz commented Jan 21, 2025

ngxson commented Jan 21, 2025 •

edited

Loading

gnusupport commented Jan 21, 2025

ngxson commented Jan 21, 2025

Mushoz commented Jan 21, 2025

ngxson commented Jan 21, 2025 •

edited

Loading

Mushoz commented Jan 21, 2025

Mushoz commented Jan 22, 2025 •

edited

Loading

ngxson commented Jan 22, 2025

Mushoz commented Jan 22, 2025

Mushoz commented Jan 22, 2025

brittlewis12 commented Jan 25, 2025

gnusupport commented Jan 26, 2025

francogrex commented Jan 28, 2025 •

edited

Loading

ochafik commented Jan 29, 2025

MoonRide303 commented Jan 30, 2025

gnusupport commented Jan 30, 2025

MoonRide303 commented Jan 30, 2025

ochafik commented Feb 3, 2025 •

edited

Loading

aotsukiqx commented Feb 6, 2025 •

edited

Loading

ochafik commented Feb 6, 2025

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

Comments

gnusupport commented Jan 21, 2025

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

ngxson commented Jan 21, 2025

Mushoz commented Jan 21, 2025 • edited Loading

ngxson commented Jan 21, 2025 • edited Loading

ngxson commented Jan 21, 2025

Mushoz commented Jan 21, 2025

ngxson commented Jan 21, 2025 • edited Loading

gnusupport commented Jan 21, 2025

ngxson commented Jan 21, 2025

Mushoz commented Jan 21, 2025

ngxson commented Jan 21, 2025 • edited Loading

Mushoz commented Jan 21, 2025

Mushoz commented Jan 22, 2025 • edited Loading

ngxson commented Jan 22, 2025

Mushoz commented Jan 22, 2025

Mushoz commented Jan 22, 2025

brittlewis12 commented Jan 25, 2025

gnusupport commented Jan 26, 2025

francogrex commented Jan 28, 2025 • edited Loading

ochafik commented Jan 29, 2025

MoonRide303 commented Jan 30, 2025

gnusupport commented Jan 30, 2025

MoonRide303 commented Jan 30, 2025

ochafik commented Feb 3, 2025 • edited Loading

aotsukiqx commented Feb 6, 2025 • edited Loading

ochafik commented Feb 6, 2025

Mushoz commented Jan 21, 2025 •

edited

Loading

ngxson commented Jan 21, 2025 •

edited

Loading

ngxson commented Jan 21, 2025 •

edited

Loading

ngxson commented Jan 21, 2025 •

edited

Loading

Mushoz commented Jan 22, 2025 •

edited

Loading

francogrex commented Jan 28, 2025 •

edited

Loading

ochafik commented Feb 3, 2025 •

edited

Loading

aotsukiqx commented Feb 6, 2025 •

edited

Loading