Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

idales · 2025-02-12T06:30:04Z

Improved KV cache loading performance for Vulkan, resulting in a 20x acceleration

In the current implementation, when loading the KV cache, the ggml_backend_tensor_set function is called repeatedly for each layer. For the Vulkan backend, each call takes 1-2 ms, but due to the large number of calls, loading the entire cache can take around 0.5 seconds, depending on the model being used. This is particularly noticeable in applications where prompts need to be changed frequently and quickly.

Optimization Idea:
The proposed optimization involves assembling the KV cache for each layer in memory and then loading it into the backend with a single call to ggml_backend_tensor_set. In my case, this approach resulted in a 20x improvement in KV cache loading speed.

When running the CI pipeline locally, after approximately 1.5 hours of working, I encountered the following error:

1.24.132.134 I perplexity: 80.88 seconds per pass - ETA 5.38 minutes
[1]inf,terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
./ci/run.sh: line 614: 1312784 Aborted                 (core dumped) ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4

real    1m46,938s
user    0m1,617s
sys     0m0,911s

However, the same error occurs when running the original, unmodified code. I am unsure if there are specific tests for cache loading, but I have tested the modified code in my application and did not encounter any issues.

…acceleration

0cc4m · 2025-02-15T08:33:29Z

Interesting. There is probably higher overhead on Vulkan because async memory transfer operations are not implemented. I don't think this can be merged in this form, but we should definitely discuss possible solutions (@slaren )

slaren · 2025-02-16T21:49:14Z

The problem in this case is not that the KV cache is loaded layer by layer, this PR does not change that, and it still requires one call to ggml_backend_tensor_set per layer. Without flash attention, the V cache is transposed, which results in n_embd_v transfers per layer to restore it, and that is what this change may improve, by building a complete copy of the V cache in memory first and then transferring it at once. However, in cases where the size of the cache being loaded is small compared to the total size of the cache, this will result in a much larger amount of data transferred to the GPU, so I don't think this change would be universally faster, in addition to increasing memory usage.

Supporting flash attention would solve this problem by allowing the use of a non-transposed V cache, then a single ggml_backend_tensor_set call per layer would be enough. So I believe it would be preferable to work in that direction instead. Using async memory transfers could also help, but it would also require loading the entire cache into system memory first, increasing memory usage.

Improved KV cache loading performance for Vulkan, resulting in a 20x …

028a4bb

…acceleration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

idales commented Feb 12, 2025

0cc4m commented Feb 15, 2025

slaren commented Feb 16, 2025

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Are you sure you want to change the base?

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Conversation

idales commented Feb 12, 2025

0cc4m commented Feb 15, 2025

slaren commented Feb 16, 2025