-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache wrong code #34232
Comments
Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik cc @gante |
maybe it should be "value_cache" rather than "key_cache", but i don't know it well |
Oh right, didn't notice it! Yes, that needs to be fixed and weird we didn't catch any tests failing. Feel free to open a PR if you are willing to 😄 and tag @gante for review. If you don't have bandwidth, we'll make sure to fix it soon. Thanks for reporting! |
@zucchini-nlp Hi I want to use this thread to ask a somewhat related question. I want to basically to extend the "Re-use Cache to continue generation" tutorial https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation to batched case. But the model gives erroneous output. By preliminary debugging, I suspect it's because of the default left padding used, so that the cache positions are not aligned correctly. (not sure whether the bug from issue contributes as well or not ) I want to know that are there any existing code that I can refer to? Thanks! |
@mearcstapa-gqz you mean use a batched cache from pre-fill stage in batched generation or use one same pre-fill prompt but continue generate with multiple texts at once? Please share your minimal code and I'll see what might be the error, as expanding to batched generation should be straightforward unless I am missing anything |
@zucchini-nlp Thanks! On second look, I noticed that the example provided https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation My use case is the same as the example code. I have I would try to debug my self then, should it fails, I would provide a minimal code and ask for help again. |
@zucchini-nlp
|
Hmm you're right, in case we want to do batching the padding will not be set correctly because the initial prompt has no padding while the subsequent calls will be padded on the left. So we'll end up with sequences as follows:
I don't see an easy way to overcome this unless we start supporting nested tensors. Also cc @gante , if you have any ideas or maybe we add this to out TODO list |
@zucchini-nlp Acutally I tried to make the input look like
Is it because the attention_mask passed is actually generated inside the model? |
Please see: #25420 (comment) for why padding-side/batching matters when generating |
I have a question: why is the KV cache already computed for INITIAL_PROMPT, but the current input still needs to append INITIAL_PROMPT? Wouldn't this lead to calculating INIT twice? |
Hi 👋 |
@lzl-mt it is from the way @mayankagarwals sorry, didn't see your comment. Since the person who opened the issue was off for a few weeks, I opened the fix myself in #34746 |
Thank you for your response. But what is the benefit of doing this? The post KV cache already contains the historical information, so we can directly input the current prompt. Why concatenate current user‘s input with the historical prompt and then trim it? e.g., considering the multi-turn conversation. Thx! |
cc @gante for that question when you come back from vacation |
Hi folks 👋 I can confirm that we do NOT support passing the cache without the corresponding Adding the feature is in our roadmap :) |
System Info
Although this method is un-useful, but it's little wrong
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
nan
Expected behavior
fix
The text was updated successfully, but these errors were encountered: