Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling cache_prompt on completion request fills KV cache quickly #4989

Closed
BruceMacD opened this issue Jan 16, 2024 · 5 comments
Closed

Enabling cache_prompt on completion request fills KV cache quickly #4989

BruceMacD opened this issue Jan 16, 2024 · 5 comments

Comments

@BruceMacD
Copy link
Contributor

BruceMacD commented Jan 16, 2024

llama.cpp version: 5c99960

When running the llama.cpp example server and sending requests with cache_prompt the model will start predicting continuously and fill the KV cache. The amount of time it takes varies based on context size, but the default context size (512) can run out of KV cache very quickly, within 3 requests.

Expected Behavior

Enabling prompt caching does not effect inference, and request fails gracefully on filled KV cache.

Current Behavior

Enabling cache_prompt on requests to the example server's /completion endpoint results in a filled KV cache quite quickly, and continuous prediction before the failure.

Environment and Context

$ system_profiler SPSoftwareDataType SPHardwareDataType
Software:
    System Software Overview:
      System Version: macOS 14.1 (23B2073)
      Kernel Version: Darwin 23.1.0
Hardware:
    Hardware Overview:
      Model Name: MacBook Pro
      Model Identifier: Mac15,8
      Chip: Apple M3 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      System Firmware Version: 10151.41.12
      OS Loader Version: 10151.41.12

Reproduction

  1. Start the llama.cpp example server with default configuration.
./server -m mistral-7b-instruct-v0.2.Q4_0.gguf
  1. Run this python script to send requests:
import requests


def main():
    url = "http://127.0.0.1:8080/completion"
    data = {
        "prompt": "Why is the sky blue?",
        "cache_prompt": True,
    }
    total = 0

    for i in range(100):
            print(f"sending request {i=}")
            with requests.post(url, json=data) as response: # Hangs about every 5 requests on my system
                if not response.ok:
                    print(response)

if __name__ == "__main__":
    main()
  1. After a few requests the KV cache will be full.
    Here is the relevant logging:
print_timings: prompt eval time =      16.85 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =    1325.03 ms /    86 runs   (   15.41 ms per token,    64.90 tokens per second)
print_timings:       total time =    1341.87 ms
slot 0 released (93 tokens in cache)
{"timestamp":1705437767,"level":"INFO","function":"log_server_request","line":2812,"message":"request","remote_addr":"127.0.0.1","remote_port":54752,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 2]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1

Here is the prediction output on the last request before it hangs, with whitespace omitted :

The sky appears blue because of a phenomenon called Rayleigh scattering, which occurs when sunlight enters Earth's atmosphere and encounters tiny molecules of gases such as nitrogen and oxygen. These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering.

*lots of white-space omitted here*

MSMSMSMSMS

Potentially related: #4185

@hiro4bbh
Copy link

Did you fix your problem?
I think even if cache_prompt = False, you would encounter the same problem.
I doubt that the current implementation ignores this option because llama_kv_cache_find_slot seems to be called always.
Refer: https://github.com/ggerganov/llama.cpp/blob/5c999609013a30c06e6fd28be8db5c2074bcc196/llama.cpp#L6244

If you determine the maximum length of context, passing -c to server command line option seems to solve this problem.
For example, I pass -c 8192 to server command line option for n_predict = 2048 in each completion request

@sampriti026
Copy link

sampriti026 commented Jan 19, 2024

Assuming that this is related to the hanging issue caused when format json is enabled, is there anyway we can circumvent this?

@hiro4bbh
Copy link

I find this issue as searching with keyword failed to decode the batch (because I encountered this message in my use case).
I did not expect that this is related "when format json is enabled." So, if I misunderstand this issue, I am sorry to confuse you.

I tried "Reproduction" section, but I changed only the first step as follows in order to fix the issue:

./server -m mistral-7b-instruct-v0.2.Q4_0.gguf -c 8192

I downloaded from https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_0.gguf.
Then, I got the following results:

% python3 ./snippet.py 
sending request i=0
sending request i=1
sending request i=2
sending request i=3
...
sending request i=25
sending request i=26
sending request i=27
sending request i=28
sending request i=29

(I stopped after i=29. If I should try more times, please tell me)

% ./server -m mistral-7b-instruct-v0.2.Q4_0.gguf -c 8192
...
print_timings: prompt eval time =     326.72 ms /     7 tokens (   46.67 ms per token,    21.43 tokens per second)
print_timings:        eval time =   11813.43 ms /   242 runs   (   48.82 ms per token,    20.49 tokens per second)
print_timings:       total time =   12140.15 ms
slot 0 released (249 tokens in cache)
{"timestamp":1705661384,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":64617,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 1]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
...
print_timings: prompt eval time =      59.87 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =    9167.27 ms /   141 runs   (   65.02 ms per token,    15.38 tokens per second)
print_timings:       total time =    9227.14 ms
slot 0 released (148 tokens in cache)
{"timestamp":1705661760,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":49778,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 21]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits
...
print_timings: prompt eval time =      74.93 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =   20041.94 ms /   281 runs   (   71.32 ms per token,    14.02 tokens per second)
print_timings:       total time =   20116.87 ms
slot 0 released (288 tokens in cache)
{"timestamp":1705661881,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":50226,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 29]
slot 0 : in cache: 7 tokens | to process: 0 tokens
slot 0 : kv cache rm - [7, end)
slot 0 : we have to evaluate at least 1 token to generate logits

print_timings: prompt eval time =      67.74 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =  441379.86 ms /  5122 runs   (   86.17 ms per token,    11.60 tokens per second)
print_timings:       total time =  441447.61 ms
slot 0 released (5129 tokens in cache)
{"timestamp":1705662322,"level":"INFO","function":"log_server_request","line":2818,"message":"request","remote_addr":"127.0.0.1","remote_port":50307,"status":200,"method":"POST","path":"/completion","params":{}}

I did not see any failed to decode the batch error.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants