[Draft] support qk head_dim different from vo head_dim #980

defei-coder · 2024-06-06T09:24:14Z

Support query/key head_dim different from value head_dim, fix issue-753 and issue-952.
Recently, DeepSeek-V2 proposed a new attention called MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. MLA will use query/key head_dim=192 and value head_dim=128, but flashAttention not support the combination. Although this can be achieved by padding value head_dim from 128 to 192, but this way will increase global memory and hurt the performence.
In order to expand the versatility of flashAttention, I modify the code to support this ability. For compilation time considerations, only one combination is added, other combinations can be implemented by the user as needed.
Compared with padding value head_dim from 192 to 128, use query/key head_dim=192 and value head_dim=128 will save global memory and improve performence(forward will speedup about 15%, backward will speedup 5%).

NiuMa-1234 · 2024-10-23T03:34:03Z

Hi, the latest flash-attn3 is released and it supports the head-dim of 64 and 128 only for now, do you plan to support more?

bzantium · 2024-10-23T05:19:24Z

supporting combinations of (256, 256), (128, 256) would be great.

support head_dim not equal

8346e39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] support qk head_dim different from vo head_dim #980

[Draft] support qk head_dim different from vo head_dim #980

defei-coder commented Jun 6, 2024 •

edited

Loading

NiuMa-1234 commented Oct 23, 2024

bzantium commented Oct 23, 2024

[Draft] support qk head_dim different from vo head_dim #980

Are you sure you want to change the base?

[Draft] support qk head_dim different from vo head_dim #980

Conversation

defei-coder commented Jun 6, 2024 • edited Loading

NiuMa-1234 commented Oct 23, 2024

bzantium commented Oct 23, 2024

defei-coder commented Jun 6, 2024 •

edited

Loading