Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : save and restore kv cache for single seq id #6341

Merged
merged 34 commits into from
Apr 8, 2024

Conversation

kaetemi
Copy link
Collaborator

@kaetemi kaetemi commented Mar 27, 2024

See #5843

This adds llama_get_seq_size, llama_copy_seq_data, and llama_set_seq_data functions to save and restore the kv cache of a single sequence id.

On the server this adds /slot/save and /slot/restore endpoints, which will save the kv cache for the given slot along with the token cache to a file. Also added a /slot/erase to just wipe the kv cache for a slot.

Works so far, but still needs test cases, and perhaps a path parameter for the server to enable the functionality restricted within a specified folder (done). Might be missing some special cases that I'm not aware of.

@martindevans
Copy link
Contributor

Thanks very much for working on this! ❤️

Would it be worth adding a header with some metadata - e.g. magic number (to check it's the right file type), file format version (in case the format is ever tweaked in the future), some kind of model checks (so you can check if the model you're about to load it into is compatible). This doesn't have to be used at the moment, but I've often regretted designing persistent file formats without reserving some space for things like that.

Looking at the functions available from the point of view of a developer trying to integrate this with an external llama.cpp wrapper (LLamaSharp) my main concern is that there's no way to load a sequence without risk of crashing the process (GGML_ASSERT). For example if a sequence is saved with one model and then loaded with another it would crash at the moment because of these checks:

GGML_ASSERT(n_layer == n_layer_ref);
GGML_ASSERT(n_embd_v_gqa == n_embd_v_gqa_ref);

Can this be modified to fail in a more graceful way? e.g. return an error code.

@compilade compilade self-requested a review March 27, 2024 15:14
@kaetemi
Copy link
Collaborator Author

kaetemi commented Mar 27, 2024

For example if a sequence is saved with one model and then loaded with another it would crash at the moment because of these checks:

GGML_ASSERT(n_layer == n_layer_ref);
GGML_ASSERT(n_embd_v_gqa == n_embd_v_gqa_ref);

Can this be modified to fail in a more graceful way? e.g. return an error code.

Feel free to make the necessary adjustments to the code. :)

@martindevans
Copy link
Contributor

I'll give it a go. I'm not very familiar with C++, but hopefully just adapting it to return some error codes should be easy enough!

@kaetemi
Copy link
Collaborator Author

kaetemi commented Mar 27, 2024

I'll give it a go. I'm not very familiar with C++, but hopefully just adapting it to return some error codes should be easy enough!

I'm already returning 0 as failure value for the case where there's no available space in the kv cache, so maybe just do that everywhere on the format reference values checks. Might be useful to output some LOG_ERROR messages.

@martindevans
Copy link
Contributor

@kaetemi here are my proposed changes: martindevans@62370b0

@kaetemi
Copy link
Collaborator Author

kaetemi commented Mar 27, 2024

@kaetemi here are my proposed changes: martindevans@62370b0

size_t is unsigned, though, not sure if returning negative values works out here

@martindevans
Copy link
Contributor

martindevans commented Mar 27, 2024

Oops, I'm used to Rust with isize and usize 🤦‍♂️

I'll change all of the errors to simply return 0 as you suggested.

@martindevans
Copy link
Contributor

@kaetemi here's a new proposed set of changes: martindevans@b182f8f

llama.h Outdated
Comment on lines 626 to 643
LLAMA_API size_t llama_get_seq_size(
struct llama_context * ctx,
llama_seq_id seq_id);

LLAMA_API size_t llama_copy_seq_data(
struct llama_context * ctx,
uint8_t * dst,
llama_seq_id seq_id);

// Copy the sequence data (originally copied with `llama_copy_seq_data`) into a sequence.
// Returns:
// - Positive: Ok
// - Zero: Failed to load
LLAMA_API size_t llama_set_seq_data(
struct llama_context * ctx,
const uint8_t * src,
llama_seq_id dest_seq_id);

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compilade, @kaetemi and others

It would be useful to update the names of the state management API - either in this PR or in another. For example:

    LLAMA_API size_t llama_state_get_size(const struct llama_context * ctx);

    LLAMA_API size_t llama_state_get_data(
            struct llama_context * ctx,
                         uint8_t * dst);

    LLAMA_API size_t llama_state_set_data(
            struct llama_context * ctx,
                   const uint8_t * src);

    LLAMA_API bool llama_state_load_file(
            struct llama_context * ctx,
                      const char * path_session,
                     llama_token * tokens_out,
                          size_t   n_token_capacity,
                          size_t * n_token_count_out);

    LLAMA_API bool llama_state_save_file(
            struct llama_context * ctx,
                      const char * path_session,
               const llama_token * tokens,
                          size_t   n_token_count);

    LLAMA_API size_t llama_state_seq_get_size(
            struct llama_context * ctx,
                    llama_seq_id   seq_id);

    LLAMA_API size_t llama_state_seq_get_data(
            struct llama_context * ctx,
                         uint8_t * dst,
                    llama_seq_id   seq_id);

    LLAMA_API size_t llama_state_seq_set_data(
            struct llama_context * ctx,
                   const uint8_t * src,
                    llama_seq_id   seq_id);

Copy link
Collaborator Author

@kaetemi kaetemi Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks a lot more orderly. I'll change that. (done)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to deprecate the old functions using the DEPRECATED macro in llama.h and update the README section "Recent API changes" to help 3rd party devs

Copy link
Collaborator Author

@kaetemi kaetemi Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I'll do that and update the docs (done).

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for briging this to the server, please add related tests scenario in the server test framework.

@@ -3519,6 +3767,12 @@ int main(int argc, char ** argv) {
svr->Post("/v1/embeddings", handle_embeddings);
svr->Post("/tokenize", handle_tokenize);
svr->Post("/detokenize", handle_detokenize);
if (!sparams.slot_save_path.empty()) {
// only enable slot endpoints if slot_save_path is set
svr->Post("/slot/save", handle_slot_save);
Copy link
Collaborator

@phymbert phymbert Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be restful compliant, the good path must be /slots/{slot-id}?action=save, verb must not be present in the path and resources are in the plural form.

Copy link
Collaborator Author

@kaetemi kaetemi Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll look into adding server test cases (done). Adjusting the endpoints (done) as well if that's the preferred style.

@kaetemi
Copy link
Collaborator Author

kaetemi commented Mar 30, 2024

Not a backdoor. Promise. 😎

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! Please wait for @ggerganov approval

llama.cpp Outdated
Comment on lines 15282 to 15287
if (n_layer != n_layer_ref) {
return 0;
}
if (n_embd_v_gqa != n_embd_v_gqa_ref) {
return 0;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to print an error explaining the reason why the state cannot be set.

llama.cpp Outdated
Comment on lines 15325 to 15332
size_t k_size_row_ref;
memcpy(&k_size_row_ref, inp, sizeof(k_size_row_ref));
inp += sizeof(k_size_row_ref);
const size_t k_size_row = ggml_row_size(kv_self.k_l[il]->type, n_embd_k_gqa);
if (k_size_row != k_size_row_ref) {
llama_kv_cache_seq_rm(kv_self, dest_seq_id, -1, -1);
return 0;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of saving the row size in the state? If the goal is to check that the KV data types ares the same, why not export the tensor type instead?

Copy link
Collaborator Author

@kaetemi kaetemi Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems safer for parsing to keep the actual value used to calculate the data length in a binary format. (Also keeps it easier for any third-party tool to blindly parse through it.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type size is not unique. For example, q4_0 and iq4_nl have the same size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can make it stricter and add the type as well. My concern here is mainly on avoiding any chance of buffer overflows by being explicit on the sizes. (I'm also considering external tools that might want to splice together or trim sequences, which could just treat the data as a black box and only need to know the data length.)

Comment on lines +15369 to +15376
llama_file file(filepath, "wb");

file.write_u32(LLAMA_STATE_SEQ_MAGIC);
file.write_u32(LLAMA_STATE_SEQ_VERSION);

// save the prompt
file.write_u32((uint32_t)n_token_count);
file.write_raw(tokens, sizeof(llama_token) * n_token_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama_file throws on failure, and we should avoid passing these exceptions to the user. Instead, the exceptions should be caught and an error code should be returned to the user (the same issue exists in llama_save_session_file).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

@kaetemi kaetemi requested a review from ggerganov April 3, 2024 19:38
README.md Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

Thanks, let's merge after resolving the conflicts

@kaetemi
Copy link
Collaborator Author

kaetemi commented Apr 4, 2024

Thanks, let's merge after resolving the conflicts

Resolved.

Copy link
Contributor

github-actions bot commented Apr 4, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 538 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8693.12ms p(90)=23947.26ms fails=0, finish reason: stop=538 truncated=0
  • Prompt processing (pp): avg=234.67tk/s p(90)=694.1tk/s total=207.07tk/s
  • Token generation (tg): avg=96.47tk/s p(90)=257.58tk/s total=131.15tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feature/save-restore-seq commit=bf94e9f788da5acd17c7744889f26ccc958ec914

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 514.25, 514.25, 514.25, 514.25, 514.25, 464.71, 464.71, 464.71, 464.71, 464.71, 492.69, 492.69, 492.69, 492.69, 492.69, 556.54, 556.54, 556.54, 556.54, 556.54, 571.76, 571.76, 571.76, 571.76, 571.76, 572.58, 572.58, 572.58, 572.58, 572.58, 576.71, 576.71, 576.71, 576.71, 576.71, 590.44, 590.44, 590.44, 590.44, 590.44, 592.31, 592.31, 592.31, 592.31, 592.31, 600.46, 600.46, 600.46, 600.46, 600.46, 602.46, 602.46, 602.46, 602.46, 602.46, 602.15, 602.15, 602.15, 602.15, 602.15, 638.86, 638.86, 638.86, 638.86, 638.86, 663.01, 663.01, 663.01, 663.01, 663.01, 653.97, 653.97, 653.97, 653.97, 653.97, 660.01, 660.01, 660.01, 660.01, 660.01, 661.6, 661.6, 661.6, 661.6, 661.6, 665.94, 665.94, 665.94, 665.94, 665.94, 663.79, 663.79, 663.79, 663.79, 663.79, 663.73, 663.73, 663.73, 663.73, 663.73, 663.74, 663.74, 663.74, 663.74, 663.74, 667.96, 667.96, 667.96, 667.96, 667.96, 668.31, 668.31, 668.31, 668.31, 668.31, 672.79, 672.79, 672.79, 672.79, 672.79, 686.2, 686.2, 686.2, 686.2, 686.2, 688.02, 688.02, 688.02, 688.02, 688.02, 688.95, 688.95, 688.95, 688.95, 688.95, 696.95, 696.95, 696.95, 696.95, 696.95, 693.72, 693.72, 693.72, 693.72, 693.72, 691.52, 691.52, 691.52, 691.52, 691.52, 692.33, 692.33, 692.33, 692.33, 692.33, 692.43, 692.43, 692.43, 692.43, 692.43, 693.33, 693.33, 693.33, 693.33, 693.33, 695.27, 695.27, 695.27, 695.27, 695.27, 702.78, 702.78, 702.78, 702.78, 702.78, 706.21, 706.21, 706.21, 706.21, 706.21, 702.19, 702.19, 702.19, 702.19, 702.19, 687.35, 687.35, 687.35, 687.35, 687.35, 687.21, 687.21, 687.21, 687.21, 687.21, 688.46, 688.46, 688.46, 688.46, 688.46, 683.12, 683.12, 683.12, 683.12, 683.12, 669.2, 669.2, 669.2, 669.2, 669.2, 670.21, 670.21, 670.21, 670.21, 670.21, 666.0, 666.0, 666.0, 666.0, 666.0, 665.29, 665.29, 665.29, 665.29, 665.29, 663.69, 663.69, 663.69, 663.69, 663.69, 663.23, 663.23, 663.23, 663.23, 663.23, 662.15, 662.15, 662.15, 662.15, 662.15, 669.37, 669.37, 669.37, 669.37, 669.37, 669.5, 669.5, 669.5, 669.5, 669.5, 669.09, 669.09, 669.09, 669.09, 669.09, 669.03, 669.03, 669.03, 669.03, 669.03, 671.35, 671.35, 671.35, 671.35, 671.35, 675.55, 675.55, 675.55, 675.55, 675.55, 675.6, 675.6, 675.6, 675.6, 675.6, 672.4, 672.4, 672.4, 672.4, 672.4, 671.25, 671.25, 671.25, 671.25, 671.25, 673.57, 673.57, 673.57, 673.57, 673.57, 672.74, 672.74, 672.74, 672.74, 672.74, 674.47, 674.47, 674.47, 674.47, 674.47, 674.47]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 25.52, 25.52, 25.52, 25.52, 25.52, 18.56, 18.56, 18.56, 18.56, 18.56, 17.61, 17.61, 17.61, 17.61, 17.61, 17.76, 17.76, 17.76, 17.76, 17.76, 18.38, 18.38, 18.38, 18.38, 18.38, 18.99, 18.99, 18.99, 18.99, 18.99, 19.66, 19.66, 19.66, 19.66, 19.66, 19.87, 19.87, 19.87, 19.87, 19.87, 20.0, 20.0, 20.0, 20.0, 20.0, 20.16, 20.16, 20.16, 20.16, 20.16, 20.01, 20.01, 20.01, 20.01, 20.01, 19.96, 19.96, 19.96, 19.96, 19.96, 19.86, 19.86, 19.86, 19.86, 19.86, 19.6, 19.6, 19.6, 19.6, 19.6, 18.71, 18.71, 18.71, 18.71, 18.71, 18.73, 18.73, 18.73, 18.73, 18.73, 18.69, 18.69, 18.69, 18.69, 18.69, 18.83, 18.83, 18.83, 18.83, 18.83, 18.8, 18.8, 18.8, 18.8, 18.8, 18.64, 18.64, 18.64, 18.64, 18.64, 18.57, 18.57, 18.57, 18.57, 18.57, 18.47, 18.47, 18.47, 18.47, 18.47, 18.42, 18.42, 18.42, 18.42, 18.42, 18.49, 18.49, 18.49, 18.49, 18.49, 18.39, 18.39, 18.39, 18.39, 18.39, 18.43, 18.43, 18.43, 18.43, 18.43, 18.56, 18.56, 18.56, 18.56, 18.56, 18.48, 18.48, 18.48, 18.48, 18.48, 18.35, 18.35, 18.35, 18.35, 18.35, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.36, 18.42, 18.42, 18.42, 18.42, 18.42, 18.56, 18.56, 18.56, 18.56, 18.56, 18.66, 18.66, 18.66, 18.66, 18.66, 18.68, 18.68, 18.68, 18.68, 18.68, 18.63, 18.63, 18.63, 18.63, 18.63, 18.59, 18.59, 18.59, 18.59, 18.59, 18.43, 18.43, 18.43, 18.43, 18.43, 18.44, 18.44, 18.44, 18.44, 18.44, 18.5, 18.5, 18.5, 18.5, 18.5, 18.54, 18.54, 18.54, 18.54, 18.54, 18.59, 18.59, 18.59, 18.59, 18.59, 18.42, 18.42, 18.42, 18.42, 18.42, 18.19, 18.19, 18.19, 18.19, 18.19, 18.05, 18.05, 18.05, 18.05, 18.05, 18.03, 18.03, 18.03, 18.03, 18.03, 18.0, 18.0, 18.0, 18.0, 18.0, 17.55, 17.55, 17.55, 17.55, 17.55, 17.48, 17.48, 17.48, 17.48, 17.48, 17.51, 17.51, 17.51, 17.51, 17.51, 17.55, 17.55, 17.55, 17.55, 17.55, 17.6, 17.6, 17.6, 17.6, 17.6, 17.65, 17.65, 17.65, 17.65, 17.65, 17.66, 17.66, 17.66, 17.66, 17.66, 17.67, 17.67, 17.67, 17.67, 17.67, 17.66, 17.66, 17.66, 17.66, 17.66, 17.63, 17.63, 17.63, 17.63, 17.63, 17.7, 17.7, 17.7, 17.7, 17.7, 17.77, 17.77, 17.77, 17.77, 17.77, 17.84, 17.84, 17.84, 17.84, 17.84, 17.98]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.3, 0.3, 0.3, 0.3, 0.3, 0.52, 0.52, 0.52, 0.52, 0.52, 0.42, 0.42, 0.42, 0.42, 0.42, 0.44, 0.44, 0.44, 0.44, 0.44, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.07, 0.07, 0.07, 0.07, 0.07, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 538 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712345109 --> 1712345739
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0]
                    
Loading

examples/server/server.cpp Outdated Show resolved Hide resolved
@ggerganov ggerganov requested a review from ngxson April 4, 2024 16:05
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kaetemi kaetemi requested a review from ggerganov April 7, 2024 19:20
@ggerganov ggerganov merged commit beea6e1 into ggerganov:master Apr 8, 2024
61 of 62 checks passed
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* llama : save and restore kv cache for single seq id

* remove trailing whitespace

* respond error in case there's no space in the kv cache

* add kv seq save restore to test case

* add --slot-save-path arg to enable save restore and restrict save location

* Returning 0 for some cases, instead of asserting.

* cleanup error cases

* rename sequence state functions

* rename state get set functions

* add previous function names back in with DEPRECATED notice

* update doc

* adjust endpoints to preferred style

* fix restoring zero cell count

* handle seq rm return value

* unused param

* keep in the size check

* fix return types

* add server test case for slot save restore

* cleanup

* add cake

* cleanup style

* add special

* removing a whole sequence never fails

* move sequence state file functionality from server to llama to match session api and add version tags

* catch exceptions on save as well

* error log messages

* check types for stricter restore

* update server doc

* readme : update API changes date

* strict filename validation

* move include, reject bom as well

* also reject empty filename

* reject whitespace and trailing dot

---------

Co-authored-by: Martin Evans <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants