-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling cache_prompt
on completion request fills KV cache quickly
#4989
Comments
Did you fix your problem? If you determine the maximum length of context, passing |
Assuming that this is related to the hanging issue caused when format json is enabled, is there anyway we can circumvent this? |
I find this issue as searching with keyword I tried "Reproduction" section, but I changed only the first step as follows in order to fix the issue:
I downloaded from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_0.gguf.
(I stopped after
I did not see any |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
llama.cpp version: 5c99960
When running the llama.cpp example server and sending requests with
cache_prompt
the model will start predicting continuously and fill the KV cache. The amount of time it takes varies based on context size, but the default context size (512) can run out of KV cache very quickly, within 3 requests.Expected Behavior
Enabling prompt caching does not effect inference, and request fails gracefully on filled KV cache.
Current Behavior
Enabling
cache_prompt
on requests to the example server's/completion
endpoint results in a filled KV cache quite quickly, and continuous prediction before the failure.Environment and Context
Reproduction
Here is the relevant logging:
Here is the prediction output on the last request before it hangs, with whitespace omitted :
Potentially related: #4185
The text was updated successfully, but these errors were encountered: