-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update_slots : failed to decode the batch #4185
Comments
It would be expected behavior if the KV cache is actually full. Since you set the context size to 2,048 then that would mean that when generated tokens + prompt tokens add up to 2,048 the cache is full and it won't be possible to find a slot. |
Hey, thanks for your response! Okay, it seems true that it only happens for a single entry in my dataset, which then means that that entry overflows the cache. But upping the context size to 4096 does not seem to allow this entry. Also, the logs show EDIT: Okay nvm it seems it is about the next entry which didn't get logged, which is too big. Nontheless, should the server not return an error instead of hanging on this error? |
Can't argue with that. :) |
@rvandernoort To avoid such errors, you should refrain from sending the system prompt every time you make a request. It is only required once, and subsequent requests will maintain the same system prompt without the need to resend it repeatedly. |
Thanks for your suggestion. However, I'm trying to run a dataset with a high variability in system prompt per prompt. Does this have any impact on the KV cache problem or is it just a matter of better performance if I don't send the system prompt? Generally, the system prompts are very small though and I think the issue lies with large normal prompts. |
Using latest commit as of this writing: e00d2a6 After 366 successful generations of which 5 had successful context shifts, the bug strikes:
The update_slots messages repeat infinitely as others have reported. Server was launched with:
I unfortunately did not log the failing request so I cannot confirm this but I have a hunch this problem occurs when the context shift needs to happen during generation vs during prompt processing. I do not use system prompts at all! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I'm getting this same issue, though I'm not yet on the latest code (no internet access for my server at the moment, so I can't test to see if more recent versions fix it). Very annoying - this is not appropriate behavior. Appropriate behavior would be to just fail the query instead of getting stuck in an infinite loop. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do.Run 1000 prompts over an hour using the server.
Current Behavior
Please provide a detailed written description of what
llama.cpp
did, instead.Around 800 the KV cache seems to be filled up and the whole server application hangs infinitely and does not accept any requests anymore.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
Failure Information (for bugs)
Please help provide information about the failure / bug.
So I'm running a large quantity of inferences via requests to the server and the server accepts many but eventually fails to find free space in the KV cache. Based on this comment I increased the context size of the model to 2048 and this significantly increased the amount of requests it will resolve before hanging again.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
./server -m /models/TinyLLama/original/ggml-model-f32.gguf -t 12 -ngl 99 -c 2048
Failure Logs
while everything before that works fine like:
Is there something I'm not doing correctly and this is expected behaviour? Something with the system prompts (my dataset has a variaty of system prompts, that get set for every request), should it not flush the cache if it does not fit anymore? If you need more info let me know.
The text was updated successfully, but these errors were encountered: