`MistralAttention`: where is the sliding window #29777

fteufel · 2024-03-21T12:27:56Z

Hi,

I'm trying to understand the implementation of Mistral's attention in MistralAttention.
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L195
It is my understanding that it should always be using local window attention. In MistralFlashAttention2 this is very obvious, with config.sliding_window being used.

However, I'm not sure where the sliding window is used in the base MistralAttention without flash attention:

class MistralAttention(nn.Module):
    """
    Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
    and "Generating Long Sequences with Sparse Transformers".
    """

but the forward pass simply reads

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

which I understand as full self attention.

Is the sliding window only used when running with Flash Attention, or am I missing something?
Thanks!

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-03-21T15:39:27Z

cc @ArthurZucker @younesbelkada

PenutChen · 2024-03-22T01:51:33Z

I think the sliding window trick is based on masking?
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L998-L1018

fteufel · 2024-03-22T09:07:14Z

Thanks, I see. But wouldn't this throw away any computational efficiency gains expected from using a sliding window in the first place?

PenutChen · 2024-03-22T09:16:13Z

I have the same question. I think the sliding window has two aspects:

From the perspective of the attention mask, it essentially acts as a token-level sliding window that influences each token's view of the context.
From a kv-cache perspective, truncating the cache outside the window can improve computational efficiency.

Just my guess above.

ArthurZucker · 2024-03-25T09:25:49Z

Yes this would throw away the gains, and it is pretty much expected as the best way to use sliding_window is through the sdpa or the flash_attention api, unless a rotating buffer is used.

Closing as expected, feel free to discuss! 🤗

fteufel · 2024-03-25T09:46:09Z

Hi @ArthurZucker interesting - so sdpa actually exploits the local window structure of the attention mask in the backend?

ArthurZucker · 2024-03-26T13:31:06Z

It should if the mask is correctly passed yeah. New sdpa has the sliding_window argument anyway. Not sure it was correctly prepared before, important PR: #29407

ehuaa · 2024-03-27T13:07:27Z

It should if the mask is correctly passed yeah. New sdpa has the sliding_window argument anyway. Not sure it was correctly prepared before, important PR: #29407

@ArthurZucker Did you mention this pr? pytorch/pytorch#114823, which is not use sliding_window param explicitly but can handle the sliding window mask in the sdpa function, am i right?
So if we pass the right mask through _prepare_4d_causal_attention_mask_for_sdpa as you mentioned here, #29407, we can use local window feature of Mistral. But I think we can still gain some computational efficiency with local attention without the rotating buffer, because of the sparsity of attention mask of sliding window attention.

ArthurZucker · 2024-03-27T15:16:49Z

Sorry I meant the new SDPA codepath in transformers but it's not merged yet, yes as you say handles the mask

ArthurZucker closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MistralAttention`: where is the sliding window #29777

`MistralAttention`: where is the sliding window #29777

fteufel commented Mar 21, 2024

amyeroberts commented Mar 21, 2024

PenutChen commented Mar 22, 2024 •

edited

Loading

fteufel commented Mar 22, 2024

PenutChen commented Mar 22, 2024

ArthurZucker commented Mar 25, 2024 •

edited

Loading

fteufel commented Mar 25, 2024

ArthurZucker commented Mar 26, 2024

ehuaa commented Mar 27, 2024 •

edited

Loading

ArthurZucker commented Mar 27, 2024

MistralAttention: where is the sliding window #29777

MistralAttention: where is the sliding window #29777

Comments

fteufel commented Mar 21, 2024

amyeroberts commented Mar 21, 2024

PenutChen commented Mar 22, 2024 • edited Loading

fteufel commented Mar 22, 2024

PenutChen commented Mar 22, 2024

ArthurZucker commented Mar 25, 2024 • edited Loading

fteufel commented Mar 25, 2024

ArthurZucker commented Mar 26, 2024

ehuaa commented Mar 27, 2024 • edited Loading

ArthurZucker commented Mar 27, 2024

`MistralAttention`: where is the sliding window #29777

`MistralAttention`: where is the sliding window #29777

PenutChen commented Mar 22, 2024 •

edited

Loading

ArthurZucker commented Mar 25, 2024 •

edited

Loading

ehuaa commented Mar 27, 2024 •

edited

Loading