-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MistralAttention
: where is the sliding window
#29777
Comments
I think the sliding window trick is based on masking? |
Thanks, I see. But wouldn't this throw away any computational efficiency gains expected from using a sliding window in the first place? |
I have the same question. I think the sliding window has two aspects:
Just my guess above. |
Yes this would throw away the gains, and it is pretty much expected as the best way to use Closing as expected, feel free to discuss! 🤗 |
Hi @ArthurZucker interesting - so |
It should if the mask is correctly passed yeah. New sdpa has the |
@ArthurZucker Did you mention this pr? pytorch/pytorch#114823, which is not use sliding_window param explicitly but can handle the sliding window mask in the sdpa function, am i right? |
Sorry I meant the new SDPA codepath in transformers but it's not merged yet, yes as you say handles the mask |
Hi,
I'm trying to understand the implementation of Mistral's attention in
MistralAttention
.https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L195
It is my understanding that it should always be using local window attention. In
MistralFlashAttention2
this is very obvious, withconfig.sliding_window
being used.However, I'm not sure where the sliding window is used in the base
MistralAttention
without flash attention:but the forward pass simply reads
which I understand as full self attention.
Is the sliding window only used when running with Flash Attention, or am I missing something?
Thanks!
The text was updated successfully, but these errors were encountered: