Design of xxxAttention, xxxFlashAttention and xxxSdpaAttention #27988

ccdv-ai · 2023-12-13T01:27:05Z

Hey
Following the addition of torch.nn.functional.scaled_dot_product_attention (#26572), there is a lot of deduplicated code between the xxxAttention, xxxFlashAttention2 and xxxSdpaAttention classes. The main differences between the classes lie in the attention computation, the rest being the same (Q, K, V computation, cross attention and cache logic etc...).

Wouldn't it be simpler to offload the attention computation in a new shared file making the modeling files cleaner and simplify the use of these optimizations for older models? This would also ease the addition of new variants of attention in the future if there is any.

The text was updated successfully, but these errors were encountered:

fxmarty · 2023-12-13T13:07:15Z

Thank you. It's a tradeoff to have between the "one-model one-file" philosophy (https://huggingface.co/blog/transformers-design-philosophy) and offloading code to other files/classes.

We already kind of deviate from the "one-model one-file" philosophy with the KV cache refactor (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_attn_mask_utils.py) and attention mask refactor (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_attn_mask_utils.py), and one can indeed argue that we could do the same for the attention.

github-actions · 2024-01-12T08:03:23Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design of xxxAttention, xxxFlashAttention and xxxSdpaAttention #27988

Design of xxxAttention, xxxFlashAttention and xxxSdpaAttention #27988

ccdv-ai commented Dec 13, 2023

fxmarty commented Dec 13, 2023

github-actions bot commented Jan 12, 2024

Design of xxxAttention, xxxFlashAttention and xxxSdpaAttention #27988

Design of xxxAttention, xxxFlashAttention and xxxSdpaAttention #27988

Comments

ccdv-ai commented Dec 13, 2023

fxmarty commented Dec 13, 2023

github-actions bot commented Jan 12, 2024