-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC: server handling multiple clients with custom attention mask api #3490
Conversation
Works great! Here is another demo on M1 Pro with F16 7B: server-parallel-0.mp4@FSSRepo Can you allow pushes to your branch so I can push some fixes:
|
server-parallel : add "--reverse-prompt" + compiler warning fixes
I don't have some restriction in that branch. @ggerganov can you test full offloading model in your mac please? for confirm if there is some bug in faster generation scenario. |
First off - this is awesome! Thank you @FSSRepo! I am going to take the shameless opportunity here to request that adding support for speculative execution be considered here - that would make the first and only OSS LLM server I have come across that supports it out of the box! |
Can this support the same generation parameters that are used in the existing example server? |
Here is a 30B LLaMA Q4_0 serving 4 clients on M2 Ultra: server-parallel-1.mp4(I tried 7B and 13B, but the generation is so fast that I am not able to start the requests in parallel) |
I want to add the option to cancel the streaming but when I use AbortControl in the frontend it causes error, I'm thinking add a get endpoint |
And one more example using the original server-parallel-2.mp4 |
Are you working in implement that ui or just reuse the endpoints? |
No, I just noticed it works and gave it a try. I thought you implemented it.
What kind of error? |
How is this working? Are the instances sharing the same weights or the model needs to be loaded N times? |
Just divides the kv cache (context size) to a number of sequences, the limit is the context size because it is shared between clients. This not reload the model, just once. |
Only had reused the completion.js function. 😂 |
controller = new AbortController();
const response = await fetch("http://localhost:8080/completion", {
method: "POST",
body: JSON.stringify(options),
headers: {
Connection: "keep-alive",
"Content-Type": "application/json",
Accept: "text/event-stream",
},
signal: controller.signal,
});
function cancel() {
if(controller) {
/* Anyway, even though I aborted it, the slot doesn't release, it continues to generate because the stream doesn't receive the signal that the connection was closed.
Easy fix: create a stop endpoint to notify to slot release
*/
controller.abort(); // when i call this function i get DOMException
}
} |
"Note: When So that's probably normal. You can possibly try to catch the exception, like in the example there. I wouldn't think that would cause the connection not to abort properly though, so it not stopping generation on the server side is probably a different problem. |
I will fix it adding a GET endpoint |
@ggerganov can you merge the latest changes of master in this branch, please. I'm afraid to make a mistake again. |
This example is very nice to have, but I'm thinking if we should try to directly implement the functionality in the existing I'll give this PR some more time to see if people would be interested in improving this. If not, we will probably merge this and add an item on the roadmap for the future. |
I will do it, i just want that you push the latest changes of |
You can add this repo as a remote in your fork ( |
@cebtenzzre Thank you so much! |
@ggerganov "Do you want me to close this PR? I initially proposed this as a simple example to avoid the complexity of the server example. |
@FSSRepo Let's see if we can make the other PR work and if so, we will close this one |
Superseded #3677 |
From #3462, I wanted to update my fork to the latest changes from the master branch, but it went wrong :(.
Hello, I know it's something no one asked for, but some of us need it. Here's a proof of concept of a server that handles multiple clients simultaneously, thanks to the new way of working with the KV cache.
Some may wonder why reimplementing this proposal in a separate example. The current implementation of the server is quite complex, and many things could break.
Tested on:
Server.Parallel.Improvements.mp4
This is a proof of concept for now, with some feedback and assistance, we could make it more usable.
Here is the command to start the server:
Modify
--parallel
to the number of slots to process clients requests.Edit:
Lastest changes:
Note:
Many people are going to want to kill me when they see how I handle multithreading without using mutex; I never knew what that was for :(.