You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey
Following the addition of torch.nn.functional.scaled_dot_product_attention (#26572), there is a lot of deduplicated code between the xxxAttention, xxxFlashAttention2 and xxxSdpaAttention classes. The main differences between the classes lie in the attention computation, the rest being the same (Q, K, V computation, cross attention and cache logic etc...).
Wouldn't it be simpler to offload the attention computation in a new shared file making the modeling files cleaner and simplify the use of these optimizations for older models? This would also ease the addition of new variants of attention in the future if there is any.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey
Following the addition of
torch.nn.functional.scaled_dot_product_attention
(#26572), there is a lot of deduplicated code between thexxxAttention
,xxxFlashAttention2
andxxxSdpaAttention
classes. The main differences between the classes lie in the attention computation, the rest being the same (Q, K, V computation, cross attention and cache logic etc...).Wouldn't it be simpler to offload the attention computation in a new shared file making the modeling files cleaner and simplify the use of these optimizations for older models? This would also ease the addition of new variants of attention in the future if there is any.
The text was updated successfully, but these errors were encountered: