[Draft] support qk head_dim different from vo head_dim #980
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support query/key head_dim different from value head_dim, fix issue-753 and issue-952.
Recently, DeepSeek-V2 proposed a new attention called MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. MLA will use query/key head_dim=192 and value head_dim=128, but flashAttention not support the combination. Although this can be achieved by padding value head_dim from 128 to 192, but this way will increase global memory and hurt the performence.
In order to expand the versatility of flashAttention, I modify the code to support this ability. For compilation time considerations, only one combination is added, other combinations can be implemented by the user as needed.
Compared with padding value head_dim from 192 to 128, use query/key head_dim=192 and value head_dim=128 will save global memory and improve performence(forward will speedup about 15%, backward will speedup 5%).