Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved KV cache loading performance for Vulkan, resulting in a 20x … #11815

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

idales
Copy link

@idales idales commented Feb 12, 2025

Improved KV cache loading performance for Vulkan, resulting in a 20x acceleration

In the current implementation, when loading the KV cache, the ggml_backend_tensor_set function is called repeatedly for each layer. For the Vulkan backend, each call takes 1-2 ms, but due to the large number of calls, loading the entire cache can take around 0.5 seconds, depending on the model being used. This is particularly noticeable in applications where prompts need to be changed frequently and quickly.

Optimization Idea:
The proposed optimization involves assembling the KV cache for each layer in memory and then loading it into the backend with a single call to ggml_backend_tensor_set. In my case, this approach resulted in a 20x improvement in KV cache loading speed.

When running the CI pipeline locally, after approximately 1.5 hours of working, I encountered the following error:

1.24.132.134 I perplexity: 80.88 seconds per pass - ETA 5.38 minutes
[1]inf,terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Device::waitForFences: ErrorDeviceLost
./ci/run.sh: line 614: 1312784 Aborted                 (core dumped) ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4

real    1m46,938s
user    0m1,617s
sys     0m0,911s

However, the same error occurs when running the original, unmodified code. I am unsure if there are specific tests for cache loading, but I have tested the modified code in my application and did not encounter any issues.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 15, 2025

Interesting. There is probably higher overhead on Vulkan because async memory transfer operations are not implemented. I don't think this can be merged in this form, but we should definitely discuss possible solutions (@slaren )

@slaren
Copy link
Member

slaren commented Feb 16, 2025

The problem in this case is not that the KV cache is loaded layer by layer, this PR does not change that, and it still requires one call to ggml_backend_tensor_set per layer. Without flash attention, the V cache is transposed, which results in n_embd_v transfers per layer to restore it, and that is what this change may improve, by building a complete copy of the V cache in memory first and then transferring it at once. However, in cases where the size of the cache being loaded is small compared to the total size of the cache, this will result in a much larger amount of data transferred to the GPU, so I don't think this change would be universally faster, in addition to increasing memory usage.

Supporting flash attention would solve this problem by allowing the use of a non-transposed V cache, then a single ggml_backend_tensor_set call per layer would be enough. So I believe it would be preferable to work in that direction instead. Using async memory transfers could also help, but it would also require loading the entire cache into system memory first, increasing memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants