server : improvements and maintenance #4216

ggerganov · 2023-11-25T09:57:53Z

The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete

This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.

Have a look to issues labelled with server/webui.

The text was updated successfully, but these errors were encountered:

IridiumMaster · 2023-11-25T10:37:48Z

Would love if the server could get look ahead decoding and contrastive search. A collection of common presets would be very helpful for fast model evaluation. The ability to edit responses and replies in the UI would be very useful for rapidly testing prompt branches if combined with batching capabilities. Would also appreciate a simple implementation of request queuing and a server interface for the model training example. Edit: Discussion link for contrastive search : #3450 , other related topics / potential substitutes are mentioned in the thread.

ruped · 2023-11-25T16:33:23Z

Thanks for raising this issue and looking into the server example.

I think this #4201 could be relevant - although it sounds like the fix will be in the core code rather than in the server.

Since the addition of support for batching, llama.cpp could be come a viable competitor to vllm for large scale deployments. This is also helpful for individual hobbyists who are using/building AI agents (because these possibly make multiple requests in parallel to the LLMs to construct answers). So I think your suggestions around improving stability/refactor of the server example would be very valuable. Also focusing on the throughput speed particularly of batched requests (and benchmarking this against vllm).

mudler · 2023-11-25T17:16:52Z

What'd be lovely is to see also the speculative sampling added to it - would be really a great addition there

tobi · 2023-11-25T18:30:33Z

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

There are 100s of libraries and tools that integrate different subset of backends and inference libraries. Especially in the python world. This doesn't make sense. We need a simple convention by which everything can interopt. The solution is to use openai's API as a protocol on localhost. Could there be better standards? Maybe. But this is the one we have, and it works really well.

My suggestion is that clean we clean up the server and treat it and the /chat/completions endpoint as main deliverable of this repository. We can easily switch the web interface to use that as well. ./server -m ~/model should boot with the ideal default parameters read from the gguf like context size and (if we can pull it off) chat template style.

This means that existing code only needs the api_url override to be modified to work locally.


from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1")


completion = client.chat.completions.create(
  model="llama!",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

This works already. At least as long as you are loading a model that conforms to chatml and are ok with the default context size. I find that a much better vision for how LLM interopt will work in the open source space. Different servers, different, backends, all on the same proto.

FSSRepo · 2023-11-26T02:31:42Z

@ggerganov

Batched decoding endpoint?

This option to generate multiple alternatives for the same prompt requires the ability to change the seed, and the truth is, I've been having a bit of a struggle with it when adding parallel decoding, as it raises questions about how the seed should be managed.

spirobel · 2023-11-26T10:57:21Z

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

studiotatsu · 2023-11-26T17:51:11Z

The OAI API included with the server, is great I love it.
Please include llama_params "repeat_penalty" and "min_p" .

These params are much needed. Thanks.

antcodd · 2023-11-27T06:35:37Z

I think it would be good if the OAI endpoint supported the same set of parameters and defaults as the regular endpoint and sensible or argument driven defaults given many clients won't supply all parameters.

One issue is the seed is defaulting 0 instead of -1, so every regeneration is the same if the client doesn't specify a seed.

IridiumMaster · 2023-11-27T06:57:54Z

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily. This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

mudler · 2023-11-27T08:15:00Z

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily.

Sorry to jump-in in OT, but you are not sacrificing any speed nor capabilities with LocalAI - at the end the engine is always the same (llama.cpp, or vllm, or you name it) - however I see the value of having a server in llama.cpp. It's people's choice at the end of what suits better their needs. And also, the server LocalAI implementation is heavily based on that ;)

This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

For production there are quite some issues that are blockish-imho rather than this. Had several bugs in LocalAI w/ llama.cpp which makes it still difficult to navigate into that direction, which I hope gets addressed with this ticket. Things like #3969 are quite scary for prod-users.

ruped · 2023-11-27T11:31:28Z

Just a thought as a user of llama.cpp server: I imagine it's quite common for the llama.cpp Server to be used by developers who are able to add non core functionality in their own code. (e.g. Devs create their own application or library or REST server that wraps/orchestrates llama.cpp). Naturally the llama.cpp server is very convenient for this and works with any programming language. It also has a smaller/self contained API to learn.

I think some of the following can be done in dev's own code outside of llama.cpp:

basic templating
Additional interfaces (e.g. OpenAI compatibility) by setting up an intermediary server that calls llama.cpp server.
Making batch requests (by using multiple HTTP calls to llama.cpp server)

(Disclaimer: These are just examples, I haven't fully evaluated the pros/cons of implementing them outside of llama.cpp)

It's excellent if this project has the mission and bandwidth to provide functionalities like these. But if it sounds like its becoming too much work or feature creep then I imagine focusing on the bits that are impossible to do outside of llama.cpp is one of the ways to prioritise.

dongxiaolong · 2023-11-28T05:55:12Z

Hi, @ggerganov .The vllm project has a PR under construction for a chat template that can be used as a reference. vllm-project/vllm#1756

ggerganov · 2023-11-28T14:04:57Z

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:

add an API endpoint for the clients to get the Jinja string and do whatever they want with it
add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

Tostino · 2023-11-28T14:24:17Z

@ggerganov If you are going to hard code templates, this server will be totally unusable for a large number of users. I am experimenting with new templates, and would really rather the models trained with them be widely supported. Hell, there are so many variations of the chat-ml template floating around with no indication which is the correct version.

I mentioned on the other ticket that there is: https://github.com/jinja2cpp/Jinja2Cpp

Maybe that can be an optional component to add support for chat templates from the tokenizer, and hard coding can be the default code-path, I understand not wanting to add additional dependencies.

Getting the jinja string in the client is not helpful as an API endpoint, unless there is a client side compatibility layer between the chat/completions and completions endpoint.

I had opened a issue for chat template support a while ago, when I started working on it for vLLM: #3810

I implemented this for vLLM, and after going through a few rounds of testing, I had to rework things up and add additional parameters, and cli arguments to support the API properly.
We should very much stay on the same page for our implementations.

Here is the diff for my chat/completions endpoint changes: https://github.com/vllm-project/vllm/pull/1756/files#diff-38318677b76349044192bf70161371c88fb2818b85279d8fc7f2c041d83a9544

The important points from the vLLM pull request:

1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model.
2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true.
3. Update to the chat API request handling to support finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).

The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

mudler · 2023-11-28T14:31:03Z

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:
* add an API endpoint for the clients to get the Jinja string and do whatever they want with it

* add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

my personal thoughts here, but probably C++ ain't the best language for that - templating is quite easy to implement in scripted languages rather than C++, and in my opinion would undermine the maintenance and flexibility to have a lean server.cpp implementation.

Just my 2c, but maybe templating fits better on top of llama-cpp-python - which might be easier to go and to maintain (while keeping the core small and extensible)?

ggerganov · 2023-11-28T15:46:09Z

@Tostino

All templates that I've seen so far are so basic that I don't understand why we need an entire scripting language to express them. Is there a more advanced use case other than a basic for loop over the messages + add prefix/suffix?

How many templates do we expect to ever have? 10s, 100s? Even if it is 1000s, I prefer to have them hardcoded instead of building jinja2cpp (it takes 10 minutes !! to just run cmake config)

Here is sample ChatML template in a few lines of C++ that we currently use (and this is not even the best way to do it):

std::string format_chatml(std::vector<json> messages)
{
    std::ostringstream chatml_msgs;

    for (auto it = messages.begin(); it != messages.end(); ++it) {
        chatml_msgs << "<|im_start|>"
                    << json_value(*it, "role",    std::string("user")) << '\n';
        chatml_msgs << json_value(*it, "content", std::string(""))
                    << "<|im_end|>\n";
    }

    chatml_msgs << "<|im_start|>assistant" << '\n';

    return chatml_msgs.str();
}

I could be missing something, but for the moment I don't see a good reason to add Jinja support. Let's see how it goes - I'm open to reconsider, but need to see some reasonable examples and use cases that justify this dependency.

The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

I think I understand the request.add_generation_prompt parameter, but I don't understand request.echo - can you clarify / give an example?

@mudler

Yes, I agree.

Tostino · 2023-11-28T16:22:46Z

The fact is, if the rest of the ecosystem standardizes on these templates being "the way" to format messages, it will proliferate to new and unexpected use cases.

python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --chat-template ./examples/template_inkbot.jinja

Here is an example call using my inkbot template which uses echo:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "teknium/OpenHermes-2.5-Mistral-7B",
    "stream": false,
    "stop": ["\n<#bot#>","\n<#user#>"],
    "add_generation_prompt": false,
    "echo": true,
    "temperature": 0.0,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "general"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate"}	
	]
  }'

Which returns:

{"id":"cmpl-bb73e8eefb164c3194bb2b450369e1c6","object":"chat.completion","created":195778,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":"Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

vs with "echo": false:

{"id":"cmpl-86ba4dd235a84b8e9a7361b46b04ac79","object":"chat.completion","created":195723,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":" to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

Since the official OpenAI API for chat/completions doesn't allow you to complete an incomplete message, there was no point for them to implement echo in the chat/completions endpoint. The HF chat_template spec explicitly supports that feature with the add_generation_prompt parameter, so it made sense to implement echo for ease of use. It is an extension of the API, which is why I was calling it out though. I tried to choose the most likely behavior / keywords if OpenAI ever did expand their API to add echo.

Edit:
Yeah, 10 min for a cmake is painful... Unsure what the best way forward is to be honest.
But without actual support for the chat template that the model creator defined, this isn't usable for me (and many others).

FSSRepo · 2023-11-28T17:53:00Z

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Tostino · 2023-11-28T19:34:48Z

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Absolutely no one is advocating for a whole pytorch dependency chain. There just may be other options for running the jinja that don't bloat the dependency chain too badly, and I very much think it's worth discussing further to see if there is an acceptable solution that can be found.

Even if it's something like transpiling jinja to another language that we can directly run, or providing hooks for users to run a python interpreter and the jinja dependency to give the results back to the main cpp program. That way it can be optional, and fall back to hard coded options if unavailable.

Just some thoughts, take them for what you will, I am not a cpp dev.

FSSRepo · 2023-11-28T19:42:49Z

I would suggest something like creating a small utility that performs the functionality we are interested in using C++ (porting it).

Analyzing the Jinja2cpp library quickly, it has Boost as a dependency, which explains the long CMake configuration time. It could be beneficial to decouple that library and include only the necessary functions for Jinja2cpp to work, making it more lightweight.

psugihara · 2023-11-28T20:59:49Z

@tobi completely agree that server.cpp should be a first-class focus of this repo. My macOS app uses exactly the architecture you describe, hitting server on localhost. I would note however that iOS apps cannot include executables so server.cpp won't work in at least that case. Tangentially, it might make sense to pull some of the common completion/tokenizing/batching/parallelization functionality being added to server.cpp into the llama.cpp core so that each platform doesn't have to rewrite completion_loop, etc.

I also wanted to throw in an example of some ugly code I'd love to kill with built-in server.cpp templating. I'm guessing every server.cpp client has some version of this and I'm sure they all have slightly different bugs: https://github.com/psugihara/FreeChat/blob/main/mac/FreeChat/Models/NPC/PromptTemplates/Templates.swift

@Tostino After understanding more of the background here, I agree that ideally we'd want to support the jinja templates included in GGUFs. I didn't even know these were added to GGUF, that's so cool! Unfortunately I'm not seeing a ton of existing work in cpp besides the relatively heavyweight jinja2cpp you found as well. Implementing a minimal jinja2 parser seems out of scope for v1 of template support but perhaps a more incremental compromise could work...

add an endpoint for retrieving the jinja template, allowing clients to skip parsing the gguf themselves if they want to run the template directly
this endpoint could indicate whether the template is supported by server.cpp itself (server.cpp could hardcode a cpp template implementation + hash of a corresponding jinja template for example).
when requesting a chat completion, the client could indicate whether they've already templated their input

I agree with @ggerganov that the templates are pretty trivial to implement in c++ or whatever and I'd first and foremost just like to have them all in one place (ideally llama.cpp) rather than bespoke implementations in each client. A mapping from jinja template hashes to c++ functions would be the most performant setup too, even if it's a bit ugly conceptually.

If templates are added here, I can delete my implementation in FreeChat so we'll have net 0 fragmentation :)

Tostino · 2023-11-28T21:29:49Z

when requesting a chat completion, the client could indicate whether they've already templated their input

That isn't possible. You can template your response on the client side, but then you need to hit the legacy completion endpoint, because the payload for chat/completion doesn't support a formatted string, just a list of messages with role/content.

psugihara · 2023-11-28T21:44:54Z

then you need to hit the legacy completion endpoint

For my use-case that would be fine. Though it does look like there are some non-standard args supported by server's chat/completion already (e.g. mirostat).

FSSRepo · 2024-04-08T19:43:54Z

@ggerganov Is there any way to do the following with llamacpp?

Sorry for my poor use of Paint, but I wanted to convey my idea for improving the way we handle requests from different clients more efficiently and conveniently, at least for applications like chatbots. When a client sends a PDF document for processing, other clients shouldn't get stuck, and should continue to receive tokens continuously.

slaren · 2024-04-08T20:10:49Z

@FSSRepo to elaborate on what I mentioned before. The way you can do this is by splitting all the work into small batches. If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed, with every new batch keep distributing fairly all the work queued from the different clients.

ggerganov · 2024-04-09T05:49:15Z

Yes, what @slaren said. The API is flexible and server implements just one possible way of batching the inputs

phymbert · 2024-04-09T08:57:58Z

If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed

Maybe we can divide the batch-size by the number of slots at command LOAD_PROMPT to determine how many max prompt tokens a slot can include in the current batch. I will have a look

c608345 · 2024-04-13T07:18:06Z

Server HTTP API CORS is broken #6544 .

jboero · 2024-04-23T15:19:45Z

I put together a PR with some themes and an example of how people can skin/re-theme it. Anyone interested in approving? Note that graphic design is NOT my specialty.

#6848

jboero · 2024-04-23T15:20:44Z

Kartoffelsaft · 2024-05-11T19:20:52Z

I noticed that the OpenAI semi-compatible API defaults the temperature to 0, but this is different than OpenAI's actual API which defaults to 1. This can make applications that assume a non-0 value default (such as Oatmeal) much more frustrating to use. Curious if there's a reason the default differs? If it's worth changing this, I've already set up a PR: #7226

jukofyork · 2024-05-22T18:31:04Z

I've thought of something that I can't find if it has been suggested already as not sure what words best describe it:

--override-kv KEY=TYPE:VALUE
                            advanced option to override model metadata by key. may be specified multiple times.
                            types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false

It would be very helpful if we could somehow send a set of overrides for the parameters to be sent via the API (possibly using some JSON file as input). I wanted originally to use the logit-bias option via the shell script I use to run the server but realized unless the front-ends know about all the options, then it's not possible to use them.

This would also solve the person above's problem of the temperature being default of zero too and let them override it.

I think it would be pretty straightforward to implement as you could just have a JSON file as input that follows the exact same format as the API uses, and then use it to set new default values.

It's also possible this CLI option could be adapted:

  -spf FNAME, --system-prompt-file FNAME
                            set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications.

as it already loads and parses a JSON file for a few parameters....

If anybody is interested then I can probably take a look over the weekend at adding this?

It could work in 2 ways too:

As an override to the default parameters, which can then be subsequently changed via setting them in the API calls as normal. This fixes the problem me and the person above have.
As a frozen parameter to stop say family members doing daft stuff like settings the -ngl too high and crashing the server with OOM errors, etc. In this case the parameter sent from the API would be ignored.

bionicles · 2024-05-28T22:08:02Z

well, i suppose it's a breaking change and not really germane to the server per se, but i propose the leadership consider the project could be renamed something other than "llama" because that's one model from one company and this framework can surely be useful for other models besides llama from meta right?

i also think it's distateful but impossible to avoid the implications of the way the following rules and prohibitions on learning exist 100% to stifle innovation in the AI industry:

https://ai.meta.com/llama/license/

"""
v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

... non-offensive parts omitted for brevity ...

c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
"""

What makes us confident it's a good idea to tie the name of this project to the name of meta's open weights but not OSI approved license llm? e.g. for future options to use other AI models, are we sure we ought to sink hundreds of engineer years into a project named after a corpo ai model with explicitly monopolistic business terms?

i don't mean to mount a crusade here, just seems reasonable to rename llama.cpp to something else which sounds useful for more than just talking to llama (which produces outputs which we can't even use for other future AI !) and post this here because this is where the feedback link leads; bottom line for me personally is, I wouldn't ever use this for Llama but i'd consider it for Mistral

jukofyork · 2024-07-03T11:20:44Z

well, i suppose it's a breaking change and not really germane to the server per se, but i propose the leadership consider the project could be renamed something other than "llama" because that's one model from one company and this framework can surely be useful for other models besides llama from meta right?

i also think it's distateful but impossible to avoid the implications of the way the following rules and prohibitions on learning exist 100% to stifle innovation in the AI industry:

https://ai.meta.com/llama/license/

""" v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta **may** grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
... non-offensive parts omitted for brevity ...

c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials. """

What makes us confident it's a good idea to tie the name of this project to the name of meta's open weights but not OSI approved license llm? e.g. for future options to use other AI models, are we sure we ought to sink hundreds of engineer years into a project named after a corpo ai model with explicitly monopolistic business terms?

i don't mean to mount a crusade here, just seems reasonable to rename llama.cpp to something else which sounds useful for more than just talking to llama (which produces outputs which we can't even use for other future AI !) and post this here because this is where the feedback link leads; bottom line for me personally is, I wouldn't ever use this for Llama but i'd consider it for Mistral

Sorry to say it, but I think this would be a terrible idea and cause lots of confusion.

Think of it like "Hoover" or "Biro" where the default make/manufacturer became the colloquial term.

bionicles · 2024-07-03T17:19:33Z

That's fair, I realize momentum behind current naming makes renaming a nightmare. However, from a legal perspective (IANAL) if their license isn't compatible, then it's a big concern for long term use.

I wrote that fully realizing it probably wouldn't happen. please know I don't like this topic. I didn't write the llama license. Seems license incompatibility of Llama and LLM Compiler with OSI-approved projects is a big concern.

even if nobody reads the fine print or gives a crap, the fact users could get rug pulled / sued / whatever by a megacorp like meta for using llama outputs or the llama name, makes these things exponentially less valuable, even dangerous to use.

My suggestion is to be careful, take the legal stuff seriously, and consult a real lawyer about license / trademark compatibility issues, and pester meta to relicense their models so folks don't need to worry if they can share the output or use it for work stuff.

bionicles · 2024-07-29T18:08:23Z

fyi, my prior concern was fixed amidst the release of llama 3.1

ExtReMLapin · 2024-09-15T12:52:21Z

Any plan to support parallel prompts evaluation ? Or am I getting this wrong ?

there is discussion #3363 that talks about it, but it's a little old.

Right now, if there are 3 slots, two runing requests at the token generation step, sending a new request will kinda "freeze" the others requests (depending on pp batch size).

Edit : batched example uses as input the same prompt so yeah ... this is not exactly what we need.

ggerganov · 2024-09-15T14:07:41Z

Right now, if there are 3 slots, two runing requests at the token generation step, sending a new request will kinda "freeze" the others requests (depending on pp batch size).

This is the expected behaviour. You can reduce the freezing effect by using a smaller batch size, but this can negatively affect the overall performance.

Some dynamic batch size adjustments could be easily implemented though to improve certain use cases.

ExtReMLapin · 2024-09-15T14:12:02Z

Thanks for the answer.

ggerganov added help wanted Extra attention is needed refactoring Refactoring server/webui labels Nov 25, 2023

ggerganov added this to ggml : roadmap Nov 25, 2023

ggerganov moved this to In Progress in ggml : roadmap Nov 25, 2023

ggerganov pinned this issue Nov 25, 2023

mudler mentioned this issue Nov 25, 2023

llama.cpp: infinite loop of context switch mudler/LocalAI#1333

Closed

slaren mentioned this issue Mar 31, 2024

main: port basic LLaVA (multimodal) support from llava-cli #5730

Closed

ngxson mentioned this issue Apr 1, 2024

Add OpenChat, Alpaca, Vicuna chat templates #6397

Merged

This was referenced Apr 10, 2024

Server: Add prompt processing progress endpoint? #6586

Open

server: process prompt fairly accross slots #6607

Open

SignalRT mentioned this issue Apr 15, 2024

Llava Initial approach to clear images SciSharp/LLamaSharp#664

Merged

kaizau mentioned this issue Apr 17, 2024

Proposal: An alternative to chat templates #6726

Closed

4 tasks

arthw unpinned this issue Apr 19, 2024

ngxson mentioned this issue Apr 28, 2024

Generic Chat templating code with text/json file based config; main chat updated to drive its in-prefix, in-suffix and reverse-prompt from same; chat-apply-template equivalent c-api to allow use by other codes also #6834

Draft

khimaros mentioned this issue Apr 29, 2024

Add the Command R chat format abetlen/llama-cpp-python#1382

Open

HanClinto mentioned this issue Jun 5, 2024

Feature Request: Multi session chat support #7758

Closed

4 tasks

mgroeber9110 mentioned this issue Jun 30, 2024

server : fix templates for llama2, llama3 and zephyr in new UI #8196

Closed

4 tasks

nguyenhoangthuan99 mentioned this issue Sep 9, 2024

epic: llama.cpp params are settable via API call or model.yaml janhq/cortex.cpp#1151

Closed

7 tasks

ericcurtin mentioned this issue Oct 9, 2024

No API Documentation containers/ramalama#265

Closed

ggerganov added the roadmap Part of a roadmap project label Feb 4, 2025

server : improvements and maintenance #4216

server : improvements and maintenance #4216

Comments

ggerganov commented Nov 25, 2023 • edited by ngxson Loading

IridiumMaster commented Nov 25, 2023 • edited Loading

ruped commented Nov 25, 2023 • edited Loading

mudler commented Nov 25, 2023

tobi commented Nov 25, 2023 • edited Loading

FSSRepo commented Nov 26, 2023 • edited Loading

spirobel commented Nov 26, 2023 • edited Loading

studiotatsu commented Nov 26, 2023

antcodd commented Nov 27, 2023

IridiumMaster commented Nov 27, 2023

mudler commented Nov 27, 2023 • edited Loading

ruped commented Nov 27, 2023 • edited Loading

dongxiaolong commented Nov 28, 2023

ggerganov commented Nov 28, 2023

Tostino commented Nov 28, 2023 • edited Loading

mudler commented Nov 28, 2023

ggerganov commented Nov 28, 2023

Tostino commented Nov 28, 2023 • edited Loading

FSSRepo commented Nov 28, 2023 • edited Loading

Tostino commented Nov 28, 2023 • edited Loading

FSSRepo commented Nov 28, 2023 • edited Loading

psugihara commented Nov 28, 2023 • edited Loading

Tostino commented Nov 28, 2023

psugihara commented Nov 28, 2023

FSSRepo commented Apr 8, 2024 • edited Loading

slaren commented Apr 8, 2024 • edited Loading

ggerganov commented Apr 9, 2024

phymbert commented Apr 9, 2024

c608345 commented Apr 13, 2024 • edited Loading

jboero commented Apr 23, 2024 • edited Loading

jboero commented Apr 23, 2024

Kartoffelsaft commented May 11, 2024

jukofyork commented May 22, 2024 • edited Loading

bionicles commented May 28, 2024 • edited Loading

jukofyork commented Jul 3, 2024

bionicles commented Jul 3, 2024

bionicles commented Jul 29, 2024

ExtReMLapin commented Sep 15, 2024 • edited Loading

ggerganov commented Sep 15, 2024

ExtReMLapin commented Sep 15, 2024

ggerganov commented Nov 25, 2023 •

edited by ngxson

Loading

IridiumMaster commented Nov 25, 2023 •

edited

Loading

ruped commented Nov 25, 2023 •

edited

Loading

tobi commented Nov 25, 2023 •

edited

Loading

FSSRepo commented Nov 26, 2023 •

edited

Loading

spirobel commented Nov 26, 2023 •

edited

Loading

mudler commented Nov 27, 2023 •

edited

Loading

ruped commented Nov 27, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

FSSRepo commented Nov 28, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

FSSRepo commented Nov 28, 2023 •

edited

Loading

psugihara commented Nov 28, 2023 •

edited

Loading

FSSRepo commented Apr 8, 2024 •

edited

Loading

slaren commented Apr 8, 2024 •

edited

Loading

c608345 commented Apr 13, 2024 •

edited

Loading

jboero commented Apr 23, 2024 •

edited

Loading

jukofyork commented May 22, 2024 •

edited

Loading

bionicles commented May 28, 2024 •

edited

Loading

ExtReMLapin commented Sep 15, 2024 •

edited

Loading