server: main loop blocked, server stuck #5851

phymbert · 2024-03-03T08:52:06Z

Context

Call to following functions are blocking the main loop and the server stuck for all slots / requests in method update_slots

Global:

llama_batch_clear
llama_decode
llama_kv_cache_seq_cp

Per slot:

llama_batch_add
llama_kv_cache_seq_rm
llama_kv_cache_seq_add
llama_kv_cache_seq_div
llama_sampling_free
llama_sampling_init
llama_sampling_accept
llama_sampling_reset
llama_tokenize

If prompt is big enough, self extend or continuous batching are enabled.

Proposal

We need to separate slots state management, tokens retrieval from slots processing but keeping one batch for the whole server.

Firstly, it should be well tested and reproducible in the test server framework in a slow test with a real prompt and model (as in the passkey).

I see 3 options:

We are fine with that, let's wait for the high-level llama api with its own thread pool
Yet another threadpool (+ the http request pool). Initialized with n_slots which will call all this function asynchronously
Use the httplib request thread to call these blocking function

@ggerganov @ngxson please confirm the list of blocking method, which one must be thread safe (I meant only in the main loop).
I am welling to implement option 2 or 3, assign me back the issue if you are OK.

The text was updated successfully, but these errors were encountered:

phymbert · 2024-03-03T09:46:58Z

I would prefer to go with option #3:

assigns an idle slot to an http request
slot process images, prompts, sampling? in http request thread, call batch add (maybe a write lock is required here)
main loop call llama_decode and feed a read/write locked batch results list
slot thread consumes the batch result and writes the response
release the slot

I still have no idea how to deal with context self-extend. Is kv_cache_seq_* can be called concurrently ?

main loop should only be responsible for llama_decode.

Metrics/health might access a read-only slots state updated concurrently by each slot. I would prefer to get rid of the tasks queue at the main loop level as this approach is designed for non blocking ops.

But first, all should be well tested. We still lack multimodal tests.

ggerganov · 2024-03-03T10:36:50Z

We can think about the server architecture in the following way:

the httplib threads are "frontend"
the main loop thread is "backend"
the "frontend" and the "backend" are communicating via message/task queues

Only the backend can use the llama.h API. The frontend should never call directly llama.h functions. Any such requests should be deferred to the backend through tasks.

I'm not really sure that the current implementation is problematic. Yes, the response to a health check can get blocked and delayed, but when the client eventually receives the response, they know it is correct. While, if the response was asynchronous in some way, the information in it would be immediately outdated. Consider:

server is processing very long prompt
server receives a health check request to which it responds without waiting for the prompt processing to end (i.e. through some extra mutex locking and reading the intermediate state, or observing a cached state)
shortly after server responded that it is healthy, it crashes during the ongoing prompt processing
the client now incorrectly thinks the server is healthy

So the current implementation has the advantage that health and metrics provide up-to-date information, but the disadvantage of blocking requests.

I'm looking for ways to avoid adding more mutexes and locking since things can easily get complicated this way. So I'm wondering if we really need non-blocking requests to health and metrics endpoints

ngxson · 2024-03-03T10:39:21Z

The proposal is clear enough for me, but I prefer the first option because the high-level llamax is something we already discussed quite a long time now (but kinda lost in time). I see a real need of this such library, because it will also ease the development of other examples.

One thing I'm not sure about your 3rd option though: which call currently takes more than 1 miliseconds to process? I think our current blocking point is llama_decode, but the other calls finish almost immediately? (or maybe I miss something)

The health and metrics endpoint is mostly blocked by llama_decode I think.

Btw, one trick to have llama_decode to finish earlier (and thus not blocking the queue for too long) is to split the prompr into smaller chunks. However, that's purely a hack because it decrease performance a lot on GPU.

In my personal implementation, I break the prompt into chunks of 20 tokens each. The inference is done on an old CPU so performance lost is too small.

phymbert · 2024-03-03T10:44:34Z

We can think about the server architecture in the following way:

the httplib threads are "frontend"

the main loop thread is "backend"

the "frontend" and the "backend" are communicating via message/task queues

Only the backend can use the llama.h API. The frontend should never call directly llama.h functions. Any such requests should be deferred to the backend through tasks.

I'm not really sure that the current implementation is problematic. Yes, the response to a health check can get blocked and delayed, but when the client eventually receives the response, they know it is correct. While, if the response was asynchronous in some way, the information in it would be immediately outdated. Consider:

server is processing very long prompt

server receives a health check request to which it responds without waiting for the prompt processing to end (i.e. through some extra mutex locking and reading the intermediate state, or observing a cached state)

shortly after server responded that it is healthy, it crashes during the ongoing prompt processing

the client now incorrectly thinks the server is healthy

So the current implementation has the advantage that health and metrics provide up-to-date information, but the disadvantage of blocking requests.

I'm looking for ways to avoid adding more mutexes and locking since things can easily get complicated this way. So I'm wondering if we really need non-blocking requests to health and metrics endpoints

Thanks for your explanation. Understood and agreed for health. But is it normal prompt & image processing, and context self-extend for one slot is blocking the main loop ? can't it be done concurrently ?

ggerganov · 2024-03-03T10:53:31Z

Thanks for your explanation. Understood and agreed for health. But is it normal prompt & image processing, and context self-extend for one slot is blocking the main loop ? can't it be done concurrently ?

It can - it's all about how the llama_batch is populated. At the moment, we take each task and push all tokens into the batch. So if the task has a humongous prompt, then it will fill the batch with a long section of tokens for that task that will block the tasks for other slots. To improve this, we can be more clever about how we queue the tokens from different tasks and avoid having these long segments of single-slot tokens. It's probably as simple as randomly shuffling the data in llama_batch after it has been prepared, while respecting the token position order.

We can improve this at some point, but atm I don't think it is a huge issue

phymbert · 2024-03-03T11:04:42Z

Closing as it works as expected.

phymbert added bug Something isn't working server/webui labels Mar 3, 2024

phymbert assigned ggerganov Mar 3, 2024

phymbert closed this as completed Mar 3, 2024

phymbert mentioned this issue Mar 3, 2024

server: metrics endpoint #5850

Closed

phymbert added enhancement New feature or request and removed bug Something isn't working labels Mar 3, 2024

This was referenced Apr 10, 2024

Server: Add prompt processing progress endpoint? #6586

Open

server: process prompt fairly accross slots #6607

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: main loop blocked, server stuck #5851

server: main loop blocked, server stuck #5851

phymbert commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024 •

edited

Loading

ggerganov commented Mar 3, 2024

ngxson commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024

ggerganov commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024

server: main loop blocked, server stuck #5851

server: main loop blocked, server stuck #5851

Comments

phymbert commented Mar 3, 2024 • edited Loading

Context

Proposal

phymbert commented Mar 3, 2024 • edited Loading

ggerganov commented Mar 3, 2024

ngxson commented Mar 3, 2024 • edited Loading

phymbert commented Mar 3, 2024

ggerganov commented Mar 3, 2024 • edited Loading

phymbert commented Mar 3, 2024

phymbert commented Mar 3, 2024 •

edited

Loading

phymbert commented Mar 3, 2024 •

edited

Loading

ngxson commented Mar 3, 2024 •

edited

Loading

ggerganov commented Mar 3, 2024 •

edited

Loading