-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue #11847
Comments
Can you try this without using the |
Without the Request curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}' Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}",
"tool_calls": null,
"role": "assistant"
}
}
],
"created": 1739497907,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4689-90e4dba4",
"object": "chat.completion",
"usage": {
"completion_tokens": 17,
"prompt_tokens": 10,
"total_tokens": 27
},
"id": "chatcmpl-NSUouLxi9bvdjdnqwh7o1DLc4UmoCzsV",
"timings": {
"prompt_n": 1,
"prompt_ms": 624.063,
"prompt_per_token_ms": 624.063,
"prompt_per_second": 1.6024023215604835,
"predicted_n": 17,
"predicted_ms": 10365.442,
"predicted_per_token_ms": 609.7318823529412,
"predicted_per_second": 1.6400651318101052
}
} However, without Request: curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello, can you tell me the current weather in New York?"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
},
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
]
}' Response: {
"error": {
"code": 500,
"message": "tools param requires --jinja flag",
"type": "server_error"
}
} Is there anyway to use both functionality with the OpenAI compatible chat completion API? |
I think this might be a bug and I'm looking into this. If we take a look at this request processing on the server we can look at this handler: const auto handle_chat_completions = [&ctx_server, ¶ms, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
LOG_DBG("request: %s\n", req.body.c_str());
if (ctx_server.params_base.embedding) {
res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
return;
}
auto body = json::parse(req.body);
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);
return handle_completions_impl(
SERVER_TASK_TYPE_COMPLETION,
data,
req.is_connection_closed,
res,
OAICOMPAT_TYPE_CHAT);
}; We can inspect the body from the request: (gdb) pjson body
{
"model": "llama-2-7b-chat",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
} And this looks good and there is no Next we have the call to: json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates); And if we inspect the data after this call we do see the (gdb) pjson data | shell jq
{
"stop": [],
"json_schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
},
"chat_format": 1,
"prompt": "<|im_start|>system\nRespond in JSON format, either with `tool_call` (a request to call tools) or with `response` reply to the user's request<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n",
"grammar": "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \"{\" space alternative-1-response-kv \"}\" space\nalternative-1-response ::= \"{\" space alternative-1-response-response-kv \"}\" space\nalternative-1-response-kv ::= \"\\\"response\\\"\" space \":\" space alternative-1-response\nalternative-1-response-response-kv ::= \"\\\"response\\\"\" space \":\" space string\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nroot ::= alternative-0 | alternative-1\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
"grammar_lazy": false,
"grammar_triggers": [],
"preserved_tokens": [],
"model": "llama-2-7b-chat",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "chat_response",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}
} If we look in // Apply chat template to the list of messages
if (use_jinja) {
...
// TODO: support mixing schema w/ tools beyond generic format.
inputs.json_schema = json_value(llama_params, "json_schema", json());
auto chat_params = common_chat_params_init(tmpl, inputs);
llama_params["chat_format"] = static_cast<int>(chat_params.format);
llama_params["prompt"] = chat_params.prompt;
llama_params["grammar"] = chat_params.grammar;
llama_params["grammar_lazy"] = chat_params.grammar_lazy;
auto grammar_triggers = json::array();
for (const auto & trigger : chat_params.grammar_triggers) {
grammar_triggers.push_back({
{"word", trigger.word},
{"at_start", trigger.at_start},
});
}
llama_params["grammar_triggers"] = grammar_triggers; And if we inspect the (gdb) p chat_params.grammar
$2 = "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \""... Perhaps the grammar should be conditioned on the json_schema: if (inputs.json_schema == nullptr) {
llama_params["grammar"] = chat_params.grammar;
llama_params["grammar_lazy"] = chat_params.grammar_lazy;
auto grammar_triggers = json::array();
for (const auto & trigger : chat_params.grammar_triggers) {
grammar_triggers.push_back({
{"word", trigger.word},
{"at_start", trigger.at_start},
});
}
llama_params["grammar_triggers"] = grammar_triggers;
} I haven't gone through this code before so I'm unsure if this is the correct thing to do but I'll open a PR with this suggestion and perhaps others can weigh in on this. |
This commit adds a condition to check if the json_schema is null before adding the grammar and grammar_triggers to the llama_params. The motivation of this is to prevent the server from throwing an exeption as the request will have both a json_schema and a grammar field. Resolves: ggerganov#11847
@danbev I tried building llama.cpp on your branch locally and tested it, but it seems now neither the tools or response format is ignored by the model if we use the I am using the same https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B model for all these, and server is running with this command (except one sample): llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa If I add both
|
@tulang3587 Sorry but I think I might have mislead you, and the "fix" I proposed above does not seem to be correct.
Not related to this issue but I noticed that there is a specific template for the model you are using which might be useful:
|
@danbev I see the PR has been merged to master and I just tested the latest build (https://github.com/ggml-org/llama.cpp/releases/tag/b4739). It looks good so far, let me close this issue.
If tool is needed, the model is returning the Request: curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello, can you tell me the current weather in New York?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}' Response: {
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"New York, USA\"}"
},
"id": ""
}
]
}
}
],
"created": 1739930518,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 36,
"prompt_tokens": 236,
"total_tokens": 272
},
"id": "chatcmpl-TE1CyOCGjIUhugis6urAnjhkEPdMTacw",
"timings": {
"prompt_n": 17,
"prompt_ms": 2903.262,
"prompt_per_token_ms": 170.78011764705883,
"prompt_per_second": 5.855482557206342,
"predicted_n": 36,
"predicted_ms": 16253.202,
"predicted_per_token_ms": 451.4778333333333,
"predicted_per_second": 2.21494816836707
}
} If no tool is needed, Request: curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": false
},
"strict": true
}
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}' Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}"
}
}
],
"created": 1739930409,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 24,
"prompt_tokens": 224,
"total_tokens": 248
},
"id": "chatcmpl-cC7Wl3HRlTPJhhba3Lwvmp9bUmP9Fv9t",
"timings": {
"prompt_n": 222,
"prompt_ms": 30874.58,
"prompt_per_token_ms": 139.07468468468468,
"prompt_per_second": 7.190381213282901,
"predicted_n": 24,
"predicted_ms": 10396.267,
"predicted_per_token_ms": 433.1777916666667,
"predicted_per_second": 2.3085209335235426
}
} If no tool is given, Request: curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}' Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's a question, a task you need assistance with, or just general conversation, I'm here to help in any way I can. Don't hesitate to let me know what's on your mind!\"\n}"
}
}
],
"created": 1739930309,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 83,
"prompt_tokens": 10,
"total_tokens": 93
},
"id": "chatcmpl-FfBNMp6abV6hlXUIaXd8DU7WmN0gOEcX",
"timings": {
"prompt_n": 10,
"prompt_ms": 1561.899,
"prompt_per_token_ms": 156.1899,
"prompt_per_second": 6.402462643231093,
"predicted_n": 83,
"predicted_ms": 36502.426,
"predicted_per_token_ms": 439.788265060241,
"predicted_per_second": 2.2738214714824707
}
} Without the Request: curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "hermes-3-llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "hello"
}
],
"response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
}' Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's general knowledge, specific topics, or creative writing, I'm here to help however I can.\"\n}"
}
}
],
"created": 1739930230,
"model": "hermes-3-llama-3.1-8b",
"system_fingerprint": "b4739-63e489c0",
"object": "chat.completion",
"usage": {
"completion_tokens": 63,
"prompt_tokens": 10,
"total_tokens": 73
},
"id": "chatcmpl-ArH0inP8QwjVptveK9qaEKhsBoVJWbfg",
"timings": {
"prompt_n": 1,
"prompt_ms": 565.07,
"prompt_per_token_ms": 565.07,
"prompt_per_second": 1.769692250517635,
"predicted_n": 63,
"predicted_ms": 27290.53,
"predicted_per_token_ms": 433.18301587301585,
"predicted_per_second": 2.3084930926588823
}
} |
@danbev I'm not reopening this issue since this works now, but just to note that I think the This makes the model respond in json, but not using the defined schema. Request: "response_format": {
"type": "json_object",
"json_schema": {
"name": "something",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
} Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"text\": \"Hello! How can I assist you today?\"\n}"
}
}
],
...
} This just returns standard text. Request: "response_format": {
"type": "json_schema",
"json_schema": {
"name": "something",
"strict": true,
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
}
} Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
...
} This just returns standard text. Request: "response_format": {
"type": "json_schema",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
} Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?"
}
}
],
...
} This request succeeds with the formatted response: Request: "response_format": {
"type": "json_object",
"schema": {
"type": "object",
"properties": {
"response": {
"type": "string"
}
},
"required": [
"response"
],
"additionalProperties": false
}
} Response: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"response\": \"Hello! How can I assist you today?\"\n}"
}
}
],
...
} |
Name and Version
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Using "response_format" to get the structured output doesn't seem to work properly when using the OpenAI compatible "v1/chat/completions" API.
It keeps returning the "Either "json_schema" or "grammar" can be specified, but not both" error message.
I've tried using several different models from HF, and this issue happens no matter which model I loaded.
The model that I used in the below samples are this one https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B
Request:
Response:
I've tried changing the
response_format
with various values like below but it keeps returning that same error.Even using the one in the documentation (
{"type": "json_object"}
) returns the same error:To add, I tried using the
POST /completion
API and using the same GGUF model it's capable of returning using the defined JSON schema:Request:
Response:
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: