Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: "response_format" on the OpenAI compatible "v1/chat/completions" issue #11847

Closed
tulang3587 opened this issue Feb 13, 2025 · 7 comments

Comments

@tulang3587
Copy link

Name and Version

>llama-server --version
version: 4689 (90e4dba4)
built with MSVC 19.42.34436.0 for x64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa

Problem description & steps to reproduce

Using "response_format" to get the structured output doesn't seem to work properly when using the OpenAI compatible "v1/chat/completions" API.
It keeps returning the "Either "json_schema" or "grammar" can be specified, but not both" error message.

I've tried using several different models from HF, and this issue happens no matter which model I loaded.
The model that I used in the below samples are this one https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

Request:

curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'

Response:

{
    "error": {
        "code": 400,
        "message": "Either \"json_schema\" or \"grammar\" can be specified, but not both",
        "type": "invalid_request_error"
    }
}

I've tried changing the response_format with various values like below but it keeps returning that same error.

"response_format": {
    "type": "json_schema", // either "json_schema" or "json_object" shows the same error
    "json_schema": {
        "name": "chat_response",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}
"response_format": {
    "type": "json_schema", // either "json_schema" or "json_object" shows the same error
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}

Even using the one in the documentation ({"type": "json_object"}) returns the same error:

{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {"type": "json_object"}
}

To add, I tried using the POST /completion API and using the same GGUF model it's capable of returning using the defined JSON schema:

Request:

curl --location 'http://localhost:1234/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "prompt": "<|im_start|>user\nhello<|im_end|>",
    "json_schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}'

Response:

{
    "index": 0,
    "content": "{\n    \"response\": \"Hello! How can I assist you today?\"\n}",
    "tokens": [],
    "id_slot": 0,
    "stop": true,
    "model": "hermes-3-llama-3.1-8b",
    "tokens_predicted": 17,
    "tokens_evaluated": 6,
    "generation_settings": {
        "n_predict": -1,
        "seed": 4294967295,
        "temperature": 0.800000011920929,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "top_k": 40,
        "top_p": 0.949999988079071,
        "min_p": 0.05000000074505806,
        "xtc_probability": 0.0,
        "xtc_threshold": 0.10000000149011612,
        "typical_p": 1.0,
        "repeat_last_n": 64,
        "repeat_penalty": 1.0,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "dry_multiplier": 0.0,
        "dry_base": 1.75,
        "dry_allowed_length": 2,
        "dry_penalty_last_n": 4096,
        "dry_sequence_breakers": [
            "\n",
            ":",
            "\"",
            "*"
        ],
        "mirostat": 0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.10000000149011612,
        "stop": [],
        "max_tokens": -1,
        "n_keep": 0,
        "n_discard": 0,
        "ignore_eos": false,
        "stream": false,
        "logit_bias": [],
        "n_probs": 0,
        "min_keep": 0,
        "grammar": "char ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nresponse-kv ::= \"\\\"response\\\"\" space \":\" space string\nroot ::= \"{\" space response-kv \"}\" space\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
        "grammar_trigger_words": [],
        "grammar_trigger_tokens": [],
        "preserved_tokens": [],
        "samplers": [
            "penalties",
            "dry",
            "top_k",
            "typ_p",
            "top_p",
            "min_p",
            "xtc",
            "temperature"
        ],
        "speculative.n_max": 16,
        "speculative.n_min": 5,
        "speculative.p_min": 0.8999999761581421,
        "timings_per_token": false,
        "post_sampling_probs": false,
        "lora": []
    },
    "prompt": "<|begin_of_text|><|im_start|>user\nhello<|im_end|>",
    "has_new_line": true,
    "truncated": false,
    "stop_type": "eos",
    "stopping_word": "",
    "tokens_cached": 22,
    "timings": {
        "prompt_n": 6,
        "prompt_ms": 1098.932,
        "prompt_per_token_ms": 183.15533333333335,
        "prompt_per_second": 5.459846469117288,
        "predicted_n": 17,
        "predicted_ms": 7322.017,
        "predicted_per_token_ms": 430.7068823529412,
        "predicted_per_second": 2.3217646175910276
    }
}

First Bad Commit

No response

Relevant log output

@danbev
Copy link
Collaborator

danbev commented Feb 13, 2025

Can you try this without using the --jinja flag when starting the server?

@tulang3587
Copy link
Author

tulang3587 commented Feb 14, 2025

Without the --jinja flag it seems to work.

Request

curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "content": "{\n  \"response\": \"Hello! How can I assist you today?\"\n}",
                "tool_calls": null,
                "role": "assistant"
            }
        }
    ],
    "created": 1739497907,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4689-90e4dba4",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 17,
        "prompt_tokens": 10,
        "total_tokens": 27
    },
    "id": "chatcmpl-NSUouLxi9bvdjdnqwh7o1DLc4UmoCzsV",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 624.063,
        "prompt_per_token_ms": 624.063,
        "prompt_per_second": 1.6024023215604835,
        "predicted_n": 17,
        "predicted_ms": 10365.442,
        "predicted_per_token_ms": 609.7318823529412,
        "predicted_per_second": 1.6400651318101052
    }
}

However, without --jinja now I can't include any tools in the request.

Request:

curl --location 'http://localhost:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Cookie: frontend_lang=en_US' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    },
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ]
}'

Response:

{
    "error": {
        "code": 500,
        "message": "tools param requires --jinja flag",
        "type": "server_error"
    }
}

Is there anyway to use both functionality with the OpenAI compatible chat completion API?

@danbev
Copy link
Collaborator

danbev commented Feb 14, 2025

Is there anyway to use both functionality with the OpenAI compatible chat completion API?

I think this might be a bug and I'm looking into this.

If we take a look at this request processing on the server we can look at this handler:

    const auto handle_chat_completions = [&ctx_server, &params, &res_error, &handle_completions_impl](const httplib::Request & req, httplib::Response & res) {
        LOG_DBG("request: %s\n", req.body.c_str());
        if (ctx_server.params_base.embedding) {
            res_error(res, format_error_response("This server does not support completions. Start it without `--embeddings`", ERROR_TYPE_NOT_SUPPORTED));
            return;
        }

        auto body = json::parse(req.body);
        json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);

        return handle_completions_impl(
            SERVER_TASK_TYPE_COMPLETION,
            data,
            req.is_connection_closed,
            res,
            OAICOMPAT_TYPE_CHAT);
    };

We can inspect the body from the request:

(gdb) pjson body
{
    "model": "llama-2-7b-chat",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}

And this looks good and there is no grammar attribute in the body.

Next we have the call to:

        json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);

And if we inspect the data after this call we do see the grammar attribute:

(gdb) pjson data | shell jq
{
    "stop": [],
    "json_schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    },
    "chat_format": 1,
    "prompt": "<|im_start|>system\nRespond in JSON format, either with `tool_call` (a request to call tools) or with `response` reply to the user's request<|im_end|>\n<|im_start|>user\nhello<|im_end|>\n<|im_start|>assistant\n",
    "grammar": "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \"{\" space alternative-1-response-kv \"}\" space\nalternative-1-response ::= \"{\" space alternative-1-response-response-kv \"}\" space\nalternative-1-response-kv ::= \"\\\"response\\\"\" space \":\" space alternative-1-response\nalternative-1-response-response-kv ::= \"\\\"response\\\"\" space \":\" space string\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\nroot ::= alternative-0 | alternative-1\nspace ::= | \" \" | \"\\n\" [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\n",
    "grammar_lazy": false,
    "grammar_triggers": [],
    "preserved_tokens": [],
    "model": "llama-2-7b-chat",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}

If we look in oaicompat_completion_params_parse we can see the following:

    // Apply chat template to the list of messages
    if (use_jinja) {
        ...
        // TODO: support mixing schema w/ tools beyond generic format.
        inputs.json_schema = json_value(llama_params, "json_schema", json());
        auto chat_params = common_chat_params_init(tmpl, inputs);

        llama_params["chat_format"] = static_cast<int>(chat_params.format);
        llama_params["prompt"] = chat_params.prompt;
        llama_params["grammar"] = chat_params.grammar;
        llama_params["grammar_lazy"] = chat_params.grammar_lazy;
        auto grammar_triggers = json::array();
        for (const auto & trigger : chat_params.grammar_triggers) {
            grammar_triggers.push_back({
                {"word", trigger.word},
                {"at_start", trigger.at_start},
            });
        }
        llama_params["grammar_triggers"] = grammar_triggers;

And if we inspect the chat_params we can see that the grammar attribute is
there:

(gdb) p chat_params.grammar
$2 = "alternative-0 ::= \"{\" space alternative-0-tool-call-kv \"}\" space\nalternative-0-tool-call ::= \nalternative-0-tool-call-kv ::= \"\\\"tool_call\\\"\" space \":\" space alternative-0-tool-call\nalternative-1 ::= \""...

Perhaps the grammar should be conditioned on the json_schema:

        if (inputs.json_schema == nullptr) {
            llama_params["grammar"] = chat_params.grammar;
            llama_params["grammar_lazy"] = chat_params.grammar_lazy;
            auto grammar_triggers = json::array();
            for (const auto & trigger : chat_params.grammar_triggers) {
                grammar_triggers.push_back({
                    {"word", trigger.word},
                    {"at_start", trigger.at_start},
                });
            }
            llama_params["grammar_triggers"] = grammar_triggers;
        }

I haven't gone through this code before so I'm unsure if this is the correct thing to do but I'll open a PR with this suggestion and perhaps others can weigh in on this.

danbev added a commit to danbev/llama.cpp that referenced this issue Feb 14, 2025
This commit adds a condition to check if the json_schema is null before
adding the grammar and grammar_triggers to the llama_params.

The motivation of this is to prevent the server from throwing an
exeption as the request will have both a json_schema and a grammar
field.

Resolves: ggerganov#11847
@tulang3587
Copy link
Author

@danbev I tried building llama.cpp on your branch locally and tested it, but it seems now neither the tools or response format is ignored by the model if we use the --jinja flag.

I am using the same https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B model for all these, and server is running with this command (except one sample):

llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa

If I add both response_format and tools:

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    },
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ]
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The current temperature in New York is 68°F (20°C) with partly cloudy skies. The wind is blowing at 6 mph with a humidity of 50%. It feels like 65°F (18°C)."
            }
        }
    ],
    "created": 1739758241,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 53,
        "prompt_tokens": 242,
        "total_tokens": 295
    },
    "id": "chatcmpl-QaAlEpc4cvVciz3MDcJFg5bfWN2XyZYT",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 569.15,
        "prompt_per_token_ms": 569.15,
        "prompt_per_second": 1.7570060616709129,
        "predicted_n": 53,
        "predicted_ms": 25060.5,
        "predicted_per_token_ms": 472.83962264150944,
        "predicted_per_second": 2.1148819855948604
    }
}

The model just hallucinates and responded without calling the tool.


If I only add response_format (with --jinja flag added)

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?"
            }
        }
    ],
    "created": 1739758800,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 17,
        "prompt_tokens": 44,
        "total_tokens": 61
    },
    "id": "chatcmpl-xNEddsFUUKL6dKhNc1CLqw8fMTBWTPBt",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 464.608,
        "prompt_per_token_ms": 464.608,
        "prompt_per_second": 2.1523520903643503,
        "predicted_n": 17,
        "predicted_ms": 7200.607,
        "predicted_per_token_ms": 423.5651176470588,
        "predicted_per_second": 2.36091207310717
    }
}

The model obviously can't call any tools, but response IS NOT using the requested format.


If I only add response_format (without --jinja flag added)

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_schema",
        "json_schema": {
            "name": "chat_response",
            "strict": true,
            "schema": {
                "type": "object",
                "properties": {
                    "response": {
                        "type": "string"
                    }
                },
                "required": [
                    "response"
                ],
                "additionalProperties": false
            }
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{ \"response\": \"Hello! How can I assist you today?\" }"
            }
        }
    ],
    "created": 1739759039,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 16,
        "prompt_tokens": 10,
        "total_tokens": 26
    },
    "id": "chatcmpl-WJfSYeUk6cytAZCq1uWQ38VPkyw921Nh",
    "timings": {
        "prompt_n": 10,
        "prompt_ms": 1666.412,
        "prompt_per_token_ms": 166.6412,
        "prompt_per_second": 6.000916940108448,
        "predicted_n": 16,
        "predicted_ms": 7331.583,
        "predicted_per_token_ms": 458.2239375,
        "predicted_per_second": 2.1823390664744573
    }
}

The model obviously can't call any tools, but response IS using the requested format.


If I only add tools

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ]
}'

Response:

{
    "choices": [
        {
            "finish_reason": "tool_calls",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": null,
                "tool_calls": [
                    {
                        "type": "function",
                        "function": {
                            "name": "get_weather",
                            "arguments": "{\"location\":\"New York, USA\"}"
                        },
                        "id": ""
                    }
                ]
            }
        }
    ],
    "created": 1739758968,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4714-1c9bd941",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 36,
        "prompt_tokens": 242,
        "total_tokens": 278
    },
    "id": "chatcmpl-HFubvobm8fmhG1rBX9XXL3la1fBju80h",
    "timings": {
        "prompt_n": 209,
        "prompt_ms": 29640.877,
        "prompt_per_token_ms": 141.82237799043062,
        "prompt_per_second": 7.0510734213431,
        "predicted_n": 36,
        "predicted_ms": 16525.598,
        "predicted_per_token_ms": 459.04438888888893,
        "predicted_per_second": 2.178438565430431
    }
}

The model can perform the tool call request.

@danbev
Copy link
Collaborator

danbev commented Feb 17, 2025

@tulang3587 Sorry but I think I might have mislead you, and the "fix" I proposed above does not seem to be correct.
However there is an open PR which mentions this:

Fixed & tested --jinja w/o tool call w/ grammar or json_schema

I've not had time to try it out yet, but if we can would you be able to see if that addresses your issue?
I tried out the PR and the original issue "Either "json_schema" or "grammar" can be specified, but not both" is not longer present.

Not related to this issue but I noticed that there is a specific template for the model you are using which might be useful:

--chat-template-file models/templates/NousResearch-Hermes-3-Llama-3.1-8B-tool_use.jinja

@tulang3587
Copy link
Author

@danbev I see the PR has been merged to master and I just tested the latest build (https://github.com/ggml-org/llama.cpp/releases/tag/b4739). It looks good so far, let me close this issue.

llama-server -m Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -a hermes-3-llama-3.1-8b --port 1234 --jinja -fa


If tool is needed, the model is returning the tool_call as expected.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello, can you tell me the current weather in New York?"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "tool_calls",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": null,
                "tool_calls": [
                    {
                        "type": "function",
                        "function": {
                            "name": "get_weather",
                            "arguments": "{\"location\":\"New York, USA\"}"
                        },
                        "id": ""
                    }
                ]
            }
        }
    ],
    "created": 1739930518,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 36,
        "prompt_tokens": 236,
        "total_tokens": 272
    },
    "id": "chatcmpl-TE1CyOCGjIUhugis6urAnjhkEPdMTacw",
    "timings": {
        "prompt_n": 17,
        "prompt_ms": 2903.262,
        "prompt_per_token_ms": 170.78011764705883,
        "prompt_per_second": 5.855482557206342,
        "predicted_n": 36,
        "predicted_ms": 16253.202,
        "predicted_per_token_ms": 451.4778333333333,
        "predicted_per_second": 2.21494816836707
    }
}

If no tool is needed, response_format works fine.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Bogotá, Colombia"
                        }
                    },
                    "required": [
                        "location"
                    ],
                    "additionalProperties": false
                },
                "strict": true
            }
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today?\"\n}"
            }
        }
    ],
    "created": 1739930409,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 24,
        "prompt_tokens": 224,
        "total_tokens": 248
    },
    "id": "chatcmpl-cC7Wl3HRlTPJhhba3Lwvmp9bUmP9Fv9t",
    "timings": {
        "prompt_n": 222,
        "prompt_ms": 30874.58,
        "prompt_per_token_ms": 139.07468468468468,
        "prompt_per_second": 7.190381213282901,
        "predicted_n": 24,
        "predicted_ms": 10396.267,
        "predicted_per_token_ms": 433.1777916666667,
        "predicted_per_second": 2.3085209335235426
    }
}

If no tool is given, response_format is also working fine.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's a question, a task you need assistance with, or just general conversation, I'm here to help in any way I can. Don't hesitate to let me know what's on your mind!\"\n}"
            }
        }
    ],
    "created": 1739930309,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 83,
        "prompt_tokens": 10,
        "total_tokens": 93
    },
    "id": "chatcmpl-FfBNMp6abV6hlXUIaXd8DU7WmN0gOEcX",
    "timings": {
        "prompt_n": 10,
        "prompt_ms": 1561.899,
        "prompt_per_token_ms": 156.1899,
        "prompt_per_second": 6.402462643231093,
        "predicted_n": 83,
        "predicted_ms": 36502.426,
        "predicted_per_token_ms": 439.788265060241,
        "predicted_per_second": 2.2738214714824707
    }
}

Without the --jinja flag response_format is also working fine.

Request:

curl --location 'http://127.0.0.1:1234/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "hermes-3-llama-3.1-8b",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "response_format": {
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}'

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today? Feel free to ask me anything you'd like help with, and I'll do my best to provide a helpful response. Whether it's general knowledge, specific topics, or creative writing, I'm here to help however I can.\"\n}"
            }
        }
    ],
    "created": 1739930230,
    "model": "hermes-3-llama-3.1-8b",
    "system_fingerprint": "b4739-63e489c0",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 63,
        "prompt_tokens": 10,
        "total_tokens": 73
    },
    "id": "chatcmpl-ArH0inP8QwjVptveK9qaEKhsBoVJWbfg",
    "timings": {
        "prompt_n": 1,
        "prompt_ms": 565.07,
        "prompt_per_token_ms": 565.07,
        "prompt_per_second": 1.769692250517635,
        "predicted_n": 63,
        "predicted_ms": 27290.53,
        "predicted_per_token_ms": 433.18301587301585,
        "predicted_per_second": 2.3084930926588823
    }
}

@tulang3587
Copy link
Author

@danbev I'm not reopening this issue since this works now, but just to note that I think the response_format that can be used isn't exactly matching with OpenAI and llama.cpp documentation.


This makes the model respond in json, but not using the defined schema.

Request:

"response_format": {
    "type": "json_object",
    "json_schema": {
        "name": "something",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"text\": \"Hello! How can I assist you today?\"\n}"
            }
        }
    ],
    ...
}

This just returns standard text.

Request:

"response_format": {
    "type": "json_schema",
    "json_schema": {
        "name": "something",
        "strict": true,
        "schema": {
            "type": "object",
            "properties": {
                "response": {
                    "type": "string"
                }
            },
            "required": [
                "response"
            ],
            "additionalProperties": false
        }
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?"
            }
        }
    ],
    ...
}

This just returns standard text.

Request:

"response_format": {
    "type": "json_schema",
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I assist you today?"
            }
        }
    ],
    ...
}

This request succeeds with the formatted response:

Request:

"response_format": {
    "type": "json_object",
    "schema": {
        "type": "object",
        "properties": {
            "response": {
                "type": "string"
            }
        },
        "required": [
            "response"
        ],
        "additionalProperties": false
    }
}

Response:

{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "{\n  \"response\": \"Hello! How can I assist you today?\"\n}"
            }
        }
    ],
   ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants