Is it possible to relax V shape requirements to have different head dim than q/k? #753

Maykeye · 2024-01-05T13:49:36Z

Torch's SDPA doesn't require V to have the same dimensions as inputs, it even noted in docs with different dimensions E and Ev as when V is multiplied by, head dimensions is gone and we have only L x L matrix.

In [23]: qk = torch.randn(4, 4, 4, 8).bfloat16().cuda()  

In [24]: v = torch.randn(4, 4, 4, 16).bfloat16().cuda()

In [25]: F.scaled_dot_product_attention(qk, qk, v).shape
Out[25]: torch.Size([4, 4, 4, 16])

same with xfrormers, they use K and Kv in doc.

In [26]: xops.memory_efficient_attention(qk, qk,v).shape
Out[26]: torch.Size([4, 4, 4, 16])

However flash attention 2 [2.4.2] requires head dimensions to match.

In [27]: flash_attn.flash_attn_func(qk,qk,v)....
RuntimeError: v must have shape (batch_size, seqlen_k, num_heads_k, head_size_og)

(as documented it requires all tensors to have headdim per head (error uses different name than documentation))

can it be relaxed to have different head_size for v or implementation depends on head dimensions match?

The text was updated successfully, but these errors were encountered:

tridao · 2024-01-05T17:08:08Z

While it's theoretically possible, we don't plan to do that. The reason is that we're already templating on the head dimension (32, 64, 96, 128, 160, 192, 224, 256). If V has a different head dimension we'd need to increase the number of templates by 8x, and compilation time will increase by 8x.

Maykeye · 2024-01-06T18:55:07Z

I see

NiuMa-1234 · 2024-10-23T01:53:43Z

While it's theoretically possible, we don't plan to do that. The reason is that we're already templating on the head dimension (32, 64, 96, 128, 160, 192, 224, 256). If V has a different head dimension we'd need to increase the number of templates by 8x, and compilation time will increase by 8x.

Hi,I found that the latest flash3 only supports head sizes of 64 or 128, are you planning to include more?

Maykeye closed this as completed Jan 6, 2024

defei-coder mentioned this issue Jun 6, 2024

[Draft] support qk head_dim different from vo head_dim #980

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to relax V shape requirements to have different head dim than q/k? #753

Is it possible to relax V shape requirements to have different head dim than q/k? #753

Maykeye commented Jan 5, 2024

tridao commented Jan 5, 2024

Maykeye commented Jan 6, 2024

NiuMa-1234 commented Oct 23, 2024

Is it possible to relax V shape requirements to have different head dim than q/k? #753

Is it possible to relax V shape requirements to have different head dim than q/k? #753

Comments

Maykeye commented Jan 5, 2024

tridao commented Jan 5, 2024

Maykeye commented Jan 6, 2024

NiuMa-1234 commented Oct 23, 2024