llama : save and restore kv cache for single seq id #6341

kaetemi · 2024-03-27T09:16:23Z

This adds llama_get_seq_size, llama_copy_seq_data, and llama_set_seq_data functions to save and restore the kv cache of a single sequence id.

On the server this adds /slot/save and /slot/restore endpoints, which will save the kv cache for the given slot along with the token cache to a file. Also added a /slot/erase to just wipe the kv cache for a slot.

Works so far, ~~but still needs test cases, and perhaps a path parameter for the server to enable the functionality restricted within a specified folder~~ (done). Might be missing some special cases that I'm not aware of.

martindevans · 2024-03-27T15:09:10Z

Thanks very much for working on this! ❤️

Would it be worth adding a header with some metadata - e.g. magic number (to check it's the right file type), file format version (in case the format is ever tweaked in the future), some kind of model checks (so you can check if the model you're about to load it into is compatible). This doesn't have to be used at the moment, but I've often regretted designing persistent file formats without reserving some space for things like that.

Looking at the functions available from the point of view of a developer trying to integrate this with an external llama.cpp wrapper (LLamaSharp) my main concern is that there's no way to load a sequence without risk of crashing the process (GGML_ASSERT). For example if a sequence is saved with one model and then loaded with another it would crash at the moment because of these checks:

GGML_ASSERT(n_layer == n_layer_ref);
GGML_ASSERT(n_embd_v_gqa == n_embd_v_gqa_ref);

Can this be modified to fail in a more graceful way? e.g. return an error code.

…ation

kaetemi · 2024-03-27T16:14:19Z

For example if a sequence is saved with one model and then loaded with another it would crash at the moment because of these checks:
GGML_ASSERT(n_layer == n_layer_ref);
GGML_ASSERT(n_embd_v_gqa == n_embd_v_gqa_ref);
Can this be modified to fail in a more graceful way? e.g. return an error code.

Feel free to make the necessary adjustments to the code. :)

martindevans · 2024-03-27T16:18:38Z

I'll give it a go. I'm not very familiar with C++, but hopefully just adapting it to return some error codes should be easy enough!

kaetemi · 2024-03-27T16:22:20Z

I'll give it a go. I'm not very familiar with C++, but hopefully just adapting it to return some error codes should be easy enough!

I'm already returning 0 as failure value for the case where there's no available space in the kv cache, so maybe just do that everywhere on the format reference values checks. Might be useful to output some LOG_ERROR messages.

martindevans · 2024-03-27T16:36:58Z

@kaetemi here are my proposed changes: martindevans@62370b0

kaetemi · 2024-03-27T16:41:26Z

@kaetemi here are my proposed changes: martindevans@62370b0

size_t is unsigned, though, not sure if returning negative values works out here

martindevans · 2024-03-27T16:58:45Z

Oops, I'm used to Rust with isize and usize 🤦‍♂️

I'll change all of the errors to simply return 0 as you suggested.

martindevans · 2024-03-27T17:01:46Z

@kaetemi here's a new proposed set of changes: martindevans@b182f8f

ggerganov · 2024-03-28T09:26:33Z

llama.h

+    LLAMA_API size_t llama_get_seq_size(
+            struct llama_context * ctx,
+                    llama_seq_id   seq_id);
+
+    LLAMA_API size_t llama_copy_seq_data(
+            struct llama_context * ctx,
+                         uint8_t * dst,
+                    llama_seq_id   seq_id);
+
+    // Copy the sequence data (originally copied with `llama_copy_seq_data`) into a sequence.
+    // Returns:
+    //  - Positive: Ok
+    //  - Zero: Failed to load
+    LLAMA_API size_t llama_set_seq_data(
+            struct llama_context * ctx,
+                   const uint8_t * src,
+                    llama_seq_id   dest_seq_id);
+


@compilade, @kaetemi and others

It would be useful to update the names of the state management API - either in this PR or in another. For example:

LLAMA_API size_t llama_state_get_size(const struct llama_context * ctx); LLAMA_API size_t llama_state_get_data( struct llama_context * ctx, uint8_t * dst); LLAMA_API size_t llama_state_set_data( struct llama_context * ctx, const uint8_t * src); LLAMA_API bool llama_state_load_file( struct llama_context * ctx, const char * path_session, llama_token * tokens_out, size_t n_token_capacity, size_t * n_token_count_out); LLAMA_API bool llama_state_save_file( struct llama_context * ctx, const char * path_session, const llama_token * tokens, size_t n_token_count); LLAMA_API size_t llama_state_seq_get_size( struct llama_context * ctx, llama_seq_id seq_id); LLAMA_API size_t llama_state_seq_get_data( struct llama_context * ctx, uint8_t * dst, llama_seq_id seq_id); LLAMA_API size_t llama_state_seq_set_data( struct llama_context * ctx, const uint8_t * src, llama_seq_id seq_id);

Yeah, that looks a lot more orderly. ~~I'll change that.~~ (done)

Would be useful to deprecate the old functions using the DEPRECATED macro in llama.h and update the README section "Recent API changes" to help 3rd party devs

Ok, thanks. ~~I'll do that and update the docs~~ (done).

phymbert

Thanks for briging this to the server, please add related tests scenario in the server test framework.

phymbert · 2024-03-29T07:25:59Z

examples/server/server.cpp

@@ -3519,6 +3767,12 @@ int main(int argc, char ** argv) {
    svr->Post("/v1/embeddings",       handle_embeddings);
    svr->Post("/tokenize",            handle_tokenize);
    svr->Post("/detokenize",          handle_detokenize);
+    if (!sparams.slot_save_path.empty()) {
+        // only enable slot endpoints if slot_save_path is set
+        svr->Post("/slot/save", handle_slot_save);


To be restful compliant, the good path must be /slots/{slot-id}?action=save, verb must not be present in the path and resources are in the plural form.

Alright, ~~I'll look into adding server test cases~~ (done). ~~Adjusting the endpoints~~ (done) as well if that's the preferred style.

kaetemi · 2024-03-30T15:36:03Z

Not a backdoor. Promise. 😎

…session api and add version tags

phymbert

LGTM thanks! Please wait for @ggerganov approval

slaren · 2024-03-31T20:28:06Z

llama.cpp

+    if (n_layer != n_layer_ref) {
+        return 0;
+    }
+    if (n_embd_v_gqa != n_embd_v_gqa_ref) {
+        return 0;
+    }


It would be good to print an error explaining the reason why the state cannot be set.

slaren · 2024-03-31T20:29:42Z

llama.cpp

+        size_t k_size_row_ref;
+        memcpy(&k_size_row_ref, inp, sizeof(k_size_row_ref));
+        inp += sizeof(k_size_row_ref);
+        const size_t k_size_row = ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa);
+        if (k_size_row != k_size_row_ref) {
+            llama_kv_cache_seq_rm(kv_self, dest_seq_id, -1, -1);
+            return 0;
+        }


What is the purpose of saving the row size in the state? If the goal is to check that the KV data types ares the same, why not export the tensor type instead?

Seems safer for parsing to keep the actual value used to calculate the data length in a binary format. (Also keeps it easier for any third-party tool to blindly parse through it.)

The type size is not unique. For example, q4_0 and iq4_nl have the same size.

Sure, we can make it stricter and add the type as well. My concern here is mainly on avoiding any chance of buffer overflows by being explicit on the sizes. (I'm also considering external tools that might want to splice together or trim sequences, which could just treat the data as a black box and only need to know the data length.)

slaren · 2024-03-31T20:32:10Z

llama.cpp

+    llama_file file(filepath, "wb");
+
+    file.write_u32(LLAMA_STATE_SEQ_MAGIC);
+    file.write_u32(LLAMA_STATE_SEQ_VERSION);
+
+    // save the prompt
+    file.write_u32((uint32_t)n_token_count);
+    file.write_raw(tokens, sizeof(llama_token) * n_token_count);


llama_file throws on failure, and we should avoid passing these exceptions to the user. Instead, the exceptions should be caught and an error code should be returned to the user (the same issue exists in llama_save_session_file).

README.md

ggerganov · 2024-04-04T08:56:24Z

Thanks, let's merge after resolving the conflicts

kaetemi · 2024-04-04T11:23:46Z

Thanks, let's merge after resolving the conflicts

Resolved.

github-actions · 2024-04-04T11:50:03Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 538 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8693.12ms p(90)=23947.26ms fails=0, finish reason: stop=538 truncated=0
Prompt processing (pp): avg=234.67tk/s p(90)=694.1tk/s total=207.07tk/s
Token generation (tg): avg=96.47tk/s p(90)=257.58tk/s total=131.15tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feature/save-restore-seq commit=bf94e9f788da5acd17c7744889f26ccc958ec914

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 514.25, 514.25, 514.25, 514.25, 514.25, 464.71, 464.71, 464.71, 464.71, 464.71, 492.69, 492.69, 492.69, 492.69, 492.69, 556.54, 556.54, 556.54, 556.54, 556.54, 571.76, 571.76, 571.76, 571.76, 571.76, 572.58, 572.58, 572.58, 572.58, 572.58, 576.71, 576.71, 576.71, 576.71, 576.71, 590.44, 590.44, 590.44, 590.44, 590.44, 592.31, 592.31, 592.31, 592.31, 592.31, 600.46, 600.46, 600.46, 600.46, 600.46, 602.46, 602.46, 602.46, 602.46, 602.46, 602.15, 602.15, 602.15, 602.15, 602.15, 638.86, 638.86, 638.86, 638.86, 638.86, 663.01, 663.01, 663.01, 663.01, 663.01, 653.97, 653.97, 653.97, 653.97, 653.97, 660.01, 660.01, 660.01, 660.01, 660.01, 661.6, 661.6, 661.6, 661.6, 661.6, 665.94, 665.94, 665.94, 665.94, 665.94, 663.79, 663.79, 663.79, 663.79, 663.79, 663.73, 663.73, 663.73, 663.73, 663.73, 663.74, 663.74, 663.74, 663.74, 663.74, 667.96, 667.96, 667.96, 667.96, 667.96, 668.31, 668.31, 668.31, 668.31, 668.31, 672.79, 672.79, 672.79, 672.79, 672.79, 686.2, 686.2, 686.2, 686.2, 686.2, 688.02, 688.02, 688.02, 688.02, 688.02, 688.95, 688.95, 688.95, 688.95, 688.95, 696.95, 696.95, 696.95, 696.95, 696.95, 693.72, 693.72, 693.72, 693.72, 693.72, 691.52, 691.52, 691.52, 691.52, 691.52, 692.33, 692.33, 692.33, 692.33, 692.33, 692.43, 692.43, 692.43, 692.43, 692.43, 693.33, 693.33, 693.33, 693.33, 693.33, 695.27, 695.27, 695.27, 695.27, 695.27, 702.78, 702.78, 702.78, 702.78, 702.78, 706.21, 706.21, 706.21, 706.21, 706.21, 702.19, 702.19, 702.19, 702.19, 702.19, 687.35, 687.35, 687.35, 687.35, 687.35, 687.21, 687.21, 687.21, 687.21, 687.21, 688.46, 688.46, 688.46, 688.46, 688.46, 683.12, 683.12, 683.12, 683.12, 683.12, 669.2, 669.2, 669.2, 669.2, 669.2, 670.21, 670.21, 670.21, 670.21, 670.21, 666.0, 666.0, 666.0, 666.0, 666.0, 665.29, 665.29, 665.29, 665.29, 665.29, 663.69, 663.69, 663.69, 663.69, 663.69, 663.23, 663.23, 663.23, 663.23, 663.23, 662.15, 662.15, 662.15, 662.15, 662.15, 669.37, 669.37, 669.37, 669.37, 669.37, 669.5, 669.5, 669.5, 669.5, 669.5, 669.09, 669.09, 669.09, 669.09, 669.09, 669.03, 669.03, 669.03, 669.03, 669.03, 671.35, 671.35, 671.35, 671.35, 671.35, 675.55, 675.55, 675.55, 675.55, 675.55, 675.6, 675.6, 675.6, 675.6, 675.6, 672.4, 672.4, 672.4, 672.4, 672.4, 671.25, 671.25, 671.25, 671.25, 671.25, 673.57, 673.57, 673.57, 673.57, 673.57, 672.74, 672.74, 672.74, 672.74, 672.74, 674.47, 674.47, 674.47, 674.47, 674.47, 674.47]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 25.52, 25.52, 25.52, 25.52, 25.52, 18.56, 18.56, 18.56, 18.56, 18.56, 17.61, 17.61, 17.61, 17.61, 17.61, 17.76, 17.76, 17.76, 17.76, 17.76, 18.38, 18.38, 18.38, 18.38, 18.38, 18.99, 18.99, 18.99, 18.99, 18.99, 19.66, 19.66, 19.66, 19.66, 19.66, 19.87, 19.87, 19.87, 19.87, 19.87, 20.0, 20.0, 20.0, 20.0, 20.0, 20.16, 20.16, 20.16, 20.16, 20.16, 20.01, 20.01, 20.01, 20.01, 20.01, 19.96, 19.96, 19.96, 19.96, 19.96, 19.86, 19.86, 19.86, 19.86, 19.86, 19.6, 19.6, 19.6, 19.6, 19.6, 18.71, 18.71, 18.71, 18.71, 18.71, 18.73, 18.73, 18.73, 18.73, 18.73, 18.69, 18.69, 18.69, 18.69, 18.69, 18.83, 18.83, 18.83, 18.83, 18.83, 18.8, 18.8, 18.8, 18.8, 18.8, 18.64, 18.64, 18.64, 18.64, 18.64, 18.57, 18.57, 18.57, 18.57, 18.57, 18.47, 18.47, 18.47, 18.47, 18.47, 18.42, 18.42, 18.42, 18.42, 18.42, 18.49, 18.49, 18.49, 18.49, 18.49, 18.39, 18.39, 18.39, 18.39, 18.39, 18.43, 18.43, 18.43, 18.43, 18.43, 18.56, 18.56, 18.56, 18.56, 18.56, 18.48, 18.48, 18.48, 18.48, 18.48, 18.35, 18.35, 18.35, 18.35, 18.35, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.42, 18.42, 18.42, 18.42, 18.42, 18.56, 18.56, 18.56, 18.56, 18.56, 18.66, 18.66, 18.66, 18.66, 18.66, 18.68, 18.68, 18.68, 18.68, 18.68, 18.63, 18.63, 18.63, 18.63, 18.63, 18.59, 18.59, 18.59, 18.59, 18.59, 18.43, 18.43, 18.43, 18.43, 18.43, 18.44, 18.44, 18.44, 18.44, 18.44, 18.5, 18.5, 18.5, 18.5, 18.5, 18.54, 18.54, 18.54, 18.54, 18.54, 18.59, 18.59, 18.59, 18.59, 18.59, 18.42, 18.42, 18.42, 18.42, 18.42, 18.19, 18.19, 18.19, 18.19, 18.19, 18.05, 18.05, 18.05, 18.05, 18.05, 18.03, 18.03, 18.03, 18.03, 18.03, 18.0, 18.0, 18.0, 18.0, 18.0, 17.55, 17.55, 17.55, 17.55, 17.55, 17.48, 17.48, 17.48, 17.48, 17.48, 17.51, 17.51, 17.51, 17.51, 17.51, 17.55, 17.55, 17.55, 17.55, 17.55, 17.6, 17.6, 17.6, 17.6, 17.6, 17.65, 17.65, 17.65, 17.65, 17.65, 17.66, 17.66, 17.66, 17.66, 17.66, 17.67, 17.67, 17.67, 17.67, 17.67, 17.66, 17.66, 17.66, 17.66, 17.66, 17.63, 17.63, 17.63, 17.63, 17.63, 17.7, 17.7, 17.7, 17.7, 17.7, 17.77, 17.77, 17.77, 17.77, 17.77, 17.84, 17.84, 17.84, 17.84, 17.84, 17.98]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.3, 0.3, 0.3, 0.3, 0.3, 0.52, 0.52, 0.52, 0.52, 0.52, 0.42, 0.42, 0.42, 0.42, 0.42, 0.44, 0.44, 0.44, 0.44, 0.44, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.07, 0.07, 0.07, 0.07, 0.07, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0]

examples/server/server.cpp

ngxson

LGTM. Thanks!

* llama : save and restore kv cache for single seq id * remove trailing whitespace * respond error in case there's no space in the kv cache * add kv seq save restore to test case * add --slot-save-path arg to enable save restore and restrict save location * Returning 0 for some cases, instead of asserting. * cleanup error cases * rename sequence state functions * rename state get set functions * add previous function names back in with DEPRECATED notice * update doc * adjust endpoints to preferred style * fix restoring zero cell count * handle seq rm return value * unused param * keep in the size check * fix return types * add server test case for slot save restore * cleanup * add cake * cleanup style * add special * removing a whole sequence never fails * move sequence state file functionality from server to llama to match session api and add version tags * catch exceptions on save as well * error log messages * check types for stricter restore * update server doc * readme : update API changes date * strict filename validation * move include, reject bom as well * also reject empty filename * reject whitespace and trailing dot --------- Co-authored-by: Martin Evans <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

kaetemi added 4 commits March 27, 2024 17:02

llama : save and restore kv cache for single seq id

662aaea

remove trailing whitespace

5462817

respond error in case there's no space in the kv cache

ab1c46a

add kv seq save restore to test case

02a1840

compilade self-requested a review March 27, 2024 15:14

add --slot-save-path arg to enable save restore and restrict save loc…

b8e8fac

…ation

Returning 0 for some cases, instead of asserting.

b182f8f

cleanup error cases

a2b48b9

ggerganov reviewed Mar 28, 2024

View reviewed changes

kaetemi added 2 commits March 28, 2024 22:10

rename sequence state functions

c4443d7

rename state get set functions

4d5356b

phymbert requested changes Mar 29, 2024

View reviewed changes

kaetemi added 10 commits March 30, 2024 01:53

add previous function names back in with DEPRECATED notice

bbcbf47

update doc

8b5ae29

adjust endpoints to preferred style

a71ec3d

fix restoring zero cell count

bf1d493

handle seq rm return value

8ab1a17

unused param

0d22136

keep in the size check

29f18c2

fix return types

f2e41b3

add server test case for slot save restore

92c4681

cleanup

60f685f

kaetemi added 4 commits March 30, 2024 23:39

cleanup style

ea717f7

add special

b509b8b

removing a whole sequence never fails

129b6ff

move sequence state file functionality from server to llama to match …

8af7211

…session api and add version tags

phymbert approved these changes Mar 31, 2024

View reviewed changes

slaren reviewed Mar 31, 2024

View reviewed changes

phymbert mentioned this pull request Apr 1, 2024

Server: Unix Socket Support #6413

Closed

kaetemi added 4 commits April 2, 2024 04:06

catch exceptions on save as well

3d6fa5b

error log messages

b3f6da3

check types for stricter restore

be714a0

update server doc

0ccfbf2

kaetemi requested a review from ggerganov April 3, 2024 19:38

ggerganov approved these changes Apr 4, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

readme : update API changes date

205c44c

Merge branch 'master' into feature/save-restore-seq

d9fd0d7

ngxson reviewed Apr 4, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

ggerganov requested a review from ngxson April 4, 2024 16:05

kaetemi added 4 commits April 6, 2024 00:46

strict filename validation

f2a4777

move include, reject bom as well

4a4f399

also reject empty filename

2fbf0c3

reject whitespace and trailing dot

bf94e9f

ngxson approved these changes Apr 6, 2024

View reviewed changes

kaetemi requested a review from ggerganov April 7, 2024 19:20

ggerganov merged commit beea6e1 into ggerganov:master Apr 8, 2024
61 of 62 checks passed

sasha0552 mentioned this pull request Apr 8, 2024

[FEATURE_REQUEST] Using the llama.cpp server save/load slot api for group chats SillyTavern/SillyTavern#2044

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : save and restore kv cache for single seq id #6341

llama : save and restore kv cache for single seq id #6341

kaetemi commented Mar 27, 2024 •

edited

Loading

martindevans commented Mar 27, 2024

kaetemi commented Mar 27, 2024

martindevans commented Mar 27, 2024

kaetemi commented Mar 27, 2024

martindevans commented Mar 27, 2024

kaetemi commented Mar 27, 2024

martindevans commented Mar 27, 2024 •

edited

Loading

martindevans commented Mar 27, 2024

ggerganov Mar 28, 2024

kaetemi Mar 28, 2024 •

edited

Loading

ggerganov Mar 29, 2024

kaetemi Mar 29, 2024 •

edited

Loading

phymbert left a comment

phymbert Mar 29, 2024 •

edited

Loading

kaetemi Mar 29, 2024 •

edited

Loading

kaetemi commented Mar 30, 2024

phymbert left a comment

slaren Mar 31, 2024

slaren Mar 31, 2024

kaetemi Mar 31, 2024 •

edited

Loading

slaren Mar 31, 2024

kaetemi Apr 1, 2024

slaren Mar 31, 2024

kaetemi Apr 1, 2024

ggerganov commented Apr 4, 2024

kaetemi commented Apr 4, 2024

github-actions bot commented Apr 4, 2024 •

edited

Loading

ngxson left a comment •

edited

Loading

llama : save and restore kv cache for single seq id #6341

llama : save and restore kv cache for single seq id #6341

Conversation

kaetemi commented Mar 27, 2024 • edited Loading

martindevans commented Mar 27, 2024

kaetemi commented Mar 27, 2024

martindevans commented Mar 27, 2024

kaetemi commented Mar 27, 2024

martindevans commented Mar 27, 2024

kaetemi commented Mar 27, 2024

martindevans commented Mar 27, 2024 • edited Loading

martindevans commented Mar 27, 2024

Choose a reason for hiding this comment

kaetemi Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaetemi Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

phymbert left a comment

Choose a reason for hiding this comment

phymbert Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

kaetemi Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

kaetemi commented Mar 30, 2024

phymbert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaetemi Mar 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov commented Apr 4, 2024

kaetemi commented Apr 4, 2024

github-actions bot commented Apr 4, 2024 • edited Loading

ngxson left a comment • edited Loading

Choose a reason for hiding this comment

kaetemi commented Mar 27, 2024 •

edited

Loading

martindevans commented Mar 27, 2024 •

edited

Loading

kaetemi Mar 28, 2024 •

edited

Loading

kaetemi Mar 29, 2024 •

edited

Loading

phymbert Mar 29, 2024 •

edited

Loading

kaetemi Mar 29, 2024 •

edited

Loading

kaetemi Mar 31, 2024 •

edited

Loading

github-actions bot commented Apr 4, 2024 •

edited

Loading

ngxson left a comment •

edited

Loading