Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: main loop blocked, server stuck #5851

Closed
phymbert opened this issue Mar 3, 2024 · 6 comments
Closed

server: main loop blocked, server stuck #5851

phymbert opened this issue Mar 3, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request server/webui

Comments

@phymbert
Copy link
Collaborator

phymbert commented Mar 3, 2024

Context

Call to following functions are blocking the main loop and the server stuck for all slots / requests in method update_slots

Global:

  • llama_batch_clear
  • llama_decode
  • llama_kv_cache_seq_cp

Per slot:

  • llama_batch_add
  • llama_kv_cache_seq_rm
  • llama_kv_cache_seq_add
  • llama_kv_cache_seq_div
  • llama_sampling_free
  • llama_sampling_init
  • llama_sampling_accept
  • llama_sampling_reset
  • llama_tokenize

If prompt is big enough, self extend or continuous batching are enabled.

Proposal

We need to separate slots state management, tokens retrieval from slots processing but keeping one batch for the whole server.

Firstly, it should be well tested and reproducible in the test server framework in a slow test with a real prompt and model (as in the passkey).

I see 3 options:

  1. We are fine with that, let's wait for the high-level llama api with its own thread pool
  2. Yet another threadpool (+ the http request pool). Initialized with n_slots which will call all this function asynchronously
  3. Use the httplib request thread to call these blocking function

@ggerganov @ngxson please confirm the list of blocking method, which one must be thread safe (I meant only in the main loop).
I am welling to implement option 2 or 3, assign me back the issue if you are OK.

@phymbert phymbert added bug Something isn't working server/webui labels Mar 3, 2024
@phymbert
Copy link
Collaborator Author

phymbert commented Mar 3, 2024

I would prefer to go with option #3:

  1. assigns an idle slot to an http request
  2. slot process images, prompts, sampling? in http request thread, call batch add (maybe a write lock is required here)
  3. main loop call llama_decode and feed a read/write locked batch results list
  4. slot thread consumes the batch result and writes the response
  5. release the slot

I still have no idea how to deal with context self-extend. Is kv_cache_seq_* can be called concurrently ?

main loop should only be responsible for llama_decode.

Metrics/health might access a read-only slots state updated concurrently by each slot. I would prefer to get rid of the tasks queue at the main loop level as this approach is designed for non blocking ops.

But first, all should be well tested. We still lack multimodal tests.

@ggerganov
Copy link
Owner

We can think about the server architecture in the following way:

  • the httplib threads are "frontend"
  • the main loop thread is "backend"
  • the "frontend" and the "backend" are communicating via message/task queues

Only the backend can use the llama.h API. The frontend should never call directly llama.h functions. Any such requests should be deferred to the backend through tasks.

I'm not really sure that the current implementation is problematic. Yes, the response to a health check can get blocked and delayed, but when the client eventually receives the response, they know it is correct. While, if the response was asynchronous in some way, the information in it would be immediately outdated. Consider:

  • server is processing very long prompt
  • server receives a health check request to which it responds without waiting for the prompt processing to end (i.e. through some extra mutex locking and reading the intermediate state, or observing a cached state)
  • shortly after server responded that it is healthy, it crashes during the ongoing prompt processing
  • the client now incorrectly thinks the server is healthy

So the current implementation has the advantage that health and metrics provide up-to-date information, but the disadvantage of blocking requests.

I'm looking for ways to avoid adding more mutexes and locking since things can easily get complicated this way. So I'm wondering if we really need non-blocking requests to health and metrics endpoints

@ngxson
Copy link
Collaborator

ngxson commented Mar 3, 2024

The proposal is clear enough for me, but I prefer the first option because the high-level llamax is something we already discussed quite a long time now (but kinda lost in time). I see a real need of this such library, because it will also ease the development of other examples.

One thing I'm not sure about your 3rd option though: which call currently takes more than 1 miliseconds to process? I think our current blocking point is llama_decode, but the other calls finish almost immediately? (or maybe I miss something)

The health and metrics endpoint is mostly blocked by llama_decode I think.


Btw, one trick to have llama_decode to finish earlier (and thus not blocking the queue for too long) is to split the prompr into smaller chunks. However, that's purely a hack because it decrease performance a lot on GPU.

In my personal implementation, I break the prompt into chunks of 20 tokens each. The inference is done on an old CPU so performance lost is too small.

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 3, 2024

We can think about the server architecture in the following way:

  • the httplib threads are "frontend"
  • the main loop thread is "backend"
  • the "frontend" and the "backend" are communicating via message/task queues

Only the backend can use the llama.h API. The frontend should never call directly llama.h functions. Any such requests should be deferred to the backend through tasks.

I'm not really sure that the current implementation is problematic. Yes, the response to a health check can get blocked and delayed, but when the client eventually receives the response, they know it is correct. While, if the response was asynchronous in some way, the information in it would be immediately outdated. Consider:

  • server is processing very long prompt
  • server receives a health check request to which it responds without waiting for the prompt processing to end (i.e. through some extra mutex locking and reading the intermediate state, or observing a cached state)
  • shortly after server responded that it is healthy, it crashes during the ongoing prompt processing
  • the client now incorrectly thinks the server is healthy

So the current implementation has the advantage that health and metrics provide up-to-date information, but the disadvantage of blocking requests.

I'm looking for ways to avoid adding more mutexes and locking since things can easily get complicated this way. So I'm wondering if we really need non-blocking requests to health and metrics endpoints

Thanks for your explanation. Understood and agreed for health. But is it normal prompt & image processing, and context self-extend for one slot is blocking the main loop ? can't it be done concurrently ?

@ggerganov
Copy link
Owner

ggerganov commented Mar 3, 2024

Thanks for your explanation. Understood and agreed for health. But is it normal prompt & image processing, and context self-extend for one slot is blocking the main loop ? can't it be done concurrently ?

It can - it's all about how the llama_batch is populated. At the moment, we take each task and push all tokens into the batch. So if the task has a humongous prompt, then it will fill the batch with a long section of tokens for that task that will block the tasks for other slots. To improve this, we can be more clever about how we queue the tokens from different tasks and avoid having these long segments of single-slot tokens. It's probably as simple as randomly shuffling the data in llama_batch after it has been prepared, while respecting the token position order.

We can improve this at some point, but atm I don't think it is a huge issue

@phymbert
Copy link
Collaborator Author

phymbert commented Mar 3, 2024

Closing as it works as expected.

@phymbert phymbert closed this as completed Mar 3, 2024
@phymbert phymbert added enhancement New feature or request and removed bug Something isn't working labels Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server/webui
Projects
None yet
Development

No branches or pull requests

3 participants