Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache wrong code #34232

Closed
4 tasks
mdy666 opened this issue Oct 18, 2024 · 16 comments · Fixed by #34746
Closed
4 tasks

cache wrong code #34232

mdy666 opened this issue Oct 18, 2024 · 16 comments · Fixed by #34746
Labels

Comments

@mdy666
Copy link

mdy666 commented Oct 18, 2024

System Info

Although this method is un-useful, but it's little wrong
image

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

nan

Expected behavior

fix

@mdy666 mdy666 added the bug label Oct 18, 2024
@zucchini-nlp
Copy link
Member

Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik

cc @gante

@mdy666
Copy link
Author

mdy666 commented Oct 18, 2024

Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik

cc @gante

maybe it should be "value_cache" rather than "key_cache", but i don't know it well

@zucchini-nlp
Copy link
Member

Oh right, didn't notice it! Yes, that needs to be fixed and weird we didn't catch any tests failing. Feel free to open a PR if you are willing to 😄 and tag @gante for review. If you don't have bandwidth, we'll make sure to fix it soon.

Thanks for reporting!

@mearcstapa-gqz
Copy link

mearcstapa-gqz commented Oct 18, 2024

@zucchini-nlp Hi I want to use this thread to ask a somewhat related question. I want to basically to extend the "Re-use Cache to continue generation" tutorial https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation to batched case. But the model gives erroneous output. By preliminary debugging, I suspect it's because of the default left padding used, so that the cache positions are not aligned correctly. (not sure whether the bug from issue contributes as well or not ) I want to know that are there any existing code that I can refer to? Thanks!

@zucchini-nlp
Copy link
Member

@mearcstapa-gqz you mean use a batched cache from pre-fill stage in batched generation or use one same pre-fill prompt but continue generate with multiple texts at once? Please share your minimal code and I'll see what might be the error, as expanding to batched generation should be straightforward unless I am missing anything

@mearcstapa-gqz
Copy link

mearcstapa-gqz commented Oct 18, 2024

@zucchini-nlp Thanks! On second look, I noticed that the example provided https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation is indeed batched. I got it wrong when I saw "max_batch_size=1" in the argument in StaticCache. the example code use a for-loop for prompts

My use case is the same as the example code. I have
texts=[f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}{query}<|im_end|>\n<|im_start|>assistant\n" for query in queries]
And a want to cache the f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}" part.

I would try to debug my self then, should it fails, I would provide a minimal code and ask for help again.

@mearcstapa-gqz
Copy link

mearcstapa-gqz commented Oct 21, 2024

@zucchini-nlp
the example code use a for-loop for prompts. I can't figure out how to set up the past_key_values to make it work like normal batch inference. Here's a minimal code.

import os
import copy
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left" 
# Curiously, setting tokenizer.padding_side = "right" yields coherent(? but if I switch my model to "Qwen/Qwen2-VL-2B-Instruct", padding_side right produces gibberish also) result for both get_output(inputs) and get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)). 
# But there's a warning "A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer." 
# What are the implications??
# https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side

INITIAL_PROMPT = '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n'
prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]

inputs = tokenizer([INITIAL_PROMPT + prompt + '<|im_end|>\n<|im_start|>assistant\n' for prompt in prompts], return_tensors="pt", padding=True).to("cuda")


def get_output(inputs, past_key_values=None):
    generated_ids = model.generate(**inputs, past_key_values=past_key_values,max_new_tokens=20)

    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_texts = tokenizer.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    for o in output_texts:
        print(o)

get_output(inputs) # normal batch inference


prompt_cache = DynamicCache()
inputs_initial_prompt = tokenizer([INITIAL_PROMPT] * 2, return_tensors="pt").to("cuda")

with torch.no_grad():
     prompt_cache = model(**inputs_initial_prompt, past_key_values = prompt_cache).past_key_values

get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)) # incoherent output, how to set past_key_values properly?

@zucchini-nlp
Copy link
Member

Hmm you're right, in case we want to do batching the padding will not be set correctly because the initial prompt has no padding while the subsequent calls will be padded on the left. So we'll end up with sequences as follows:

INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT

I don't see an easy way to overcome this unless we start supporting nested tensors. Also cc @gante , if you have any ideas or maybe we add this to out TODO list

@mearcstapa-gqz
Copy link

@zucchini-nlp
May I ask why something like this won't work?

Acutally I tried to make the input look like
INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT
with


texts = [f"{query}<|im_end|>\n<|im_start|>assistant\n" for query in ["Help me to write a blogpost about travelling.", "What is the capital of France?"]]

inputs = processor(
    text=texts, images=None, padding=True, return_tensors="pt"
)
inputs = inputs.to(model.device)

inputs = BatchFeature(data={
    'input_ids': torch.concat([inputs_initial_prompt.input_ids, inputs.input_ids], -1),
    'attention_mask': torch.concat([inputs_initial_prompt.attention_mask, inputs.attention_mask], -1)
})

Is it because the attention_mask passed is actually generated inside the model?

@zucchini-nlp
Copy link
Member

Please see: #25420 (comment) for why padding-side/batching matters when generating

@lzl-mt
Copy link

lzl-mt commented Nov 8, 2024

I have a question: why is the KV cache already computed for INITIAL_PROMPT, but the current input still needs to append INITIAL_PROMPT? Wouldn't this lead to calculating INIT twice?
inputs_initial_prompt = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda") in https://huggingface.co/docs/transformers/kv_cache#re-use-cache-to-continue-generation
Thx! @zucchini-nlp @mearcstapa-gqz

@mayankagarwals
Copy link
Contributor

Hi 👋
would like to raise a PR for this issue if it's still open

@zucchini-nlp
Copy link
Member

@lzl-mt it is from the way generate() works, currently we crop input ids and remove all prev ids that are already in the cache. Si there will be no twice computation :)

@mayankagarwals sorry, didn't see your comment. Since the person who opened the issue was off for a few weeks, I opened the fix myself in #34746

@lzl-mt
Copy link

lzl-mt commented Nov 20, 2024

@lzl-mt it is from the way generate() works, currently we crop input ids and remove all prev ids that are already in the cache. Si there will be no twice computation :)

@mayankagarwals sorry, didn't see your comment. Since the person who opened the issue was off for a few weeks, I opened the fix myself in #34746

Thank you for your response. But what is the benefit of doing this? The post KV cache already contains the historical information, so we can directly input the current prompt. Why concatenate current user‘s input with the historical prompt and then trim it? e.g., considering the multi-turn conversation. Thx!

@zucchini-nlp
Copy link
Member

cc @gante for that question when you come back from vacation

@gante
Copy link
Member

gante commented Jan 10, 2025

Hi folks 👋

I can confirm that we do NOT support passing the cache without the corresponding input_ids at the moment. Yes, technically the cache has all KV data, but to support all sets of arguments (cache + new ids; cache + all ids; all ids) the cache would need to know which input_ids are present in it, which is not the case at the moment. The alternative would be to disallow passing cache + all ids, but we don't want to break backward compatibility.

Adding the feature is in our roadmap :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants