Enabling `cache_prompt` on completion request fills KV cache quickly #4989

BruceMacD · 2024-01-16T21:05:14Z

llama.cpp version: 5c99960

When running the llama.cpp example server and sending requests with cache_prompt the model will start predicting continuously and fill the KV cache. The amount of time it takes varies based on context size, but the default context size (512) can run out of KV cache very quickly, within 3 requests.

Expected Behavior

Enabling prompt caching does not effect inference, and request fails gracefully on filled KV cache.

Current Behavior

Enabling cache_prompt on requests to the example server's /completion endpoint results in a filled KV cache quite quickly, and continuous prediction before the failure.

Environment and Context

$ system_profiler SPSoftwareDataType SPHardwareDataType
Software:
    System Software Overview:
      System Version: macOS 14.1 (23B2073)
      Kernel Version: Darwin 23.1.0
Hardware:
    Hardware Overview:
      Model Name: MacBook Pro
      Model Identifier: Mac15,8
      Chip: Apple M3 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      System Firmware Version: 10151.41.12
      OS Loader Version: 10151.41.12

Reproduction

Start the llama.cpp example server with default configuration.

./server -m mistral-7b-instruct-v0.2.Q4_0.gguf

Run this python script to send requests:

import requests


def main():
    url = "http://127.0.0.1:8080/completion"
    data = {
        "prompt": "Why is the sky blue?",
        "cache_prompt": True,
    }
    total = 0

    for i in range(100):
            print(f"sending request {i=}")
            with requests.post(url, json=data) as response: # Hangs about every 5 requests on my system
                if not response.ok:
                    print(response)

if __name__ == "__main__":
    main()

After a few requests the KV cache will be full.
Here is the relevant logging:

print_timings: prompt eval time =      16.85 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =    1325.03 ms /    86 runs   (   15.41 ms per token,    64.90 tokens per second)
print_timings:       total time =    1341.87 ms
slot 0 released (93 tokens in cache)
{"timestamp":1705437767,"level":"INFO","function":"log_server_request","line":2812,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 2]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1

Here is the prediction output on the last request before it hangs, with whitespace omitted :

The sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere and encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering.

*lots of white-space omitted here*

MSMSMSMSMS

Potentially related: #4185

The text was updated successfully, but these errors were encountered:

hiro4bbh · 2024-01-18T04:22:25Z

Did you fix your problem?
I think even if cache_prompt = False, you would encounter the same problem.
I doubt that the current implementation ignores this option because llama_kv_cache_find_slot seems to be called always.
Refer: https://github.com/ggerganov/llama.cpp/blob/5c999609013a30c06e6fd28be8db5c2074bcc196/llama.cpp#L6244

If you determine the maximum length of context, passing -c to server command line option seems to solve this problem.
For example, I pass -c 8192 to server command line option for n_predict = 2048 in each completion request

sampriti026 · 2024-01-19T08:57:25Z

Assuming that this is related to the hanging issue caused when format json is enabled, is there anyway we can circumvent this?

hiro4bbh · 2024-01-19T11:07:13Z

I find this issue as searching with keyword failed to decode the batch (because I encountered this message in my use case).
I did not expect that this is related "when format json is enabled." So, if I misunderstand this issue, I am sorry to confuse you.

I tried "Reproduction" section, but I changed only the first step as follows in order to fix the issue:

./server -m mistral-7b-instruct-v0.2.Q4_0.gguf -c 8192

I downloaded from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_0.gguf.
Then, I got the following results:

% python3 ./snippet.py 
sending request i=0
sending request i=1
sending request i=2
sending request i=3
...
sending request i=25
sending request i=26
sending request i=27
sending request i=28
sending request i=29

(I stopped after i=29. If I should try more times, please tell me)

% ./server -m mistral-7b-instruct-v0.2.Q4_0.gguf -c 8192
...
print_timings: prompt eval time =     326.72 ms /     7 tokens (   46.67 ms per token,    21.43 tokens per second)
print_timings:        eval time =   11813.43 ms /   242 runs   (   48.82 ms per token,    20.49 tokens per second)
print_timings:       total time =   12140.15 ms
slot 0 released (249 tokens in cache)
{"timestamp":1705661384,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":64617,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 1]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
...
print_timings: prompt eval time =      59.87 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =    9167.27 ms /   141 runs   (   65.02 ms per token,    15.38 tokens per second)
print_timings:       total time =    9227.14 ms
slot 0 released (148 tokens in cache)
{"timestamp":1705661760,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":49778,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 21]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
...
print_timings: prompt eval time =      74.93 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =   20041.94 ms /   281 runs   (   71.32 ms per token,    14.02 tokens per second)
print_timings:       total time =   20116.87 ms
slot 0 released (288 tokens in cache)
{"timestamp":1705661881,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":50226,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 29]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits

print_timings: prompt eval time =      67.74 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =  441379.86 ms /  5122 runs   (   86.17 ms per token,    11.60 tokens per second)
print_timings:       total time =  441447.61 ms
slot 0 released (5129 tokens in cache)
{"timestamp":1705662322,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":50307,"status":200,"method":"POST","path":"/completion","params":{}}

I did not see any failed to decode the batch error.

github-actions · 2024-03-18T01:33:31Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-03T01:13:43Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

BruceMacD added the bug-unconfirmed label Jan 16, 2024

BruceMacD mentioned this issue Jan 16, 2024

Enable Prompt Caching by Default ollama/ollama#2023

Closed

BruceMacD mentioned this issue Jan 25, 2024

Fix clearing kv cache between requests with the same prompt ollama/ollama#2186

Merged

Green-Sky mentioned this issue Jan 28, 2024

Port of self extension to server #5104

Merged

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling `cache_prompt` on completion request fills KV cache quickly #4989

Enabling `cache_prompt` on completion request fills KV cache quickly #4989

BruceMacD commented Jan 16, 2024 •

edited

Loading

hiro4bbh commented Jan 18, 2024

sampriti026 commented Jan 19, 2024 •

edited

Loading

hiro4bbh commented Jan 19, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 3, 2024

Enabling cache_prompt on completion request fills KV cache quickly #4989

Enabling cache_prompt on completion request fills KV cache quickly #4989

Comments

BruceMacD commented Jan 16, 2024 • edited Loading

Expected Behavior

Current Behavior

Environment and Context

Reproduction

hiro4bbh commented Jan 18, 2024

sampriti026 commented Jan 19, 2024 • edited Loading

hiro4bbh commented Jan 19, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 3, 2024

Enabling `cache_prompt` on completion request fills KV cache quickly #4989

Enabling `cache_prompt` on completion request fills KV cache quickly #4989

BruceMacD commented Jan 16, 2024 •

edited

Loading

sampriti026 commented Jan 19, 2024 •

edited

Loading