-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sliding window + chunking input for mistral model #1524
Conversation
it seems a duplicate of #1516 to fully support Mistral we would need to modify slightly the kv cache so that the max length remains at sliding_window when it comes longer than sliding_window (on the seq_len dimension). (We need to remove the first item) |
Thank you for your remark. I would update kv cache as soon as possible. In case the prompt is smaller than length of the sliding windows, it works perfectly. |
When will this be added to ctranslate2? |
It lacks two things:
for reference we do the same here in opennmt-py: |
Hello @vince62s . Thank you for your help. I just work on it in progress. For the sliding window and the Rolling buffer cache, I wait some hardware needed for testing it. Additionally, I think to fully support, I need to implement the chunking for very large input too, but I don't understand clearly the idea of chunking in case: input 10000 tokens and the window size is 4096, after 2 first layer where we can compute attention over cache and over chunk with the size of 4096, the for the next layers, which size of query that we could use - will we take the rest of token for the query of (10000 - 2 * 4096) tokens or 4096 last tokens ? Thank you in advance |
I am working on it too. |
be1275f
to
9ec70f2
Compare
6c90bcf
to
8664480
Compare
ok, fully reviewed offline with @minhthuc2502
We could add 2 small things but for later:
I tested with the examples/llama2/chat.py and works perfectly fine. EDIT: we are still having some issues with long contexts, working on it. closes #1501 |
please don't put "user questions" in PR's threads, uses the issue for that or better the forum. in the end the answer is: usage will be the same as for other models. |
7e0d6dd
to
b770497
Compare
b770497
to
b95ba9e
Compare
Mistral is similar with Llama. Add converter for Mistral and we can inference Mistral without any modification in the condition of prompt length < 4096 tokens