Add FA2 and `sdpa` support for SigLIP #31499

qubvel · 2024-06-19T16:56:15Z

What does this PR do?

Add flash attention 2 and sdpa (torch.nn.functional.scaled_dot_product_attention) attention implementations for SigLIP model.

Fixes #31138

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-06-19T17:46:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts · 2024-06-20T09:33:56Z

Nice! We can probably combine this with CLIP cc @sayakpaul

For reference, there's a FA2 siglip implementation for IDEFICS2, but I'm not sure how much testing was done for the equivalence between the eager and FA2 classes

qubvel · 2024-06-20T14:02:29Z

@amyeroberts there are some discrepancies with the attention mask, I am digging deeper to the equivalence testing

qubvel · 2024-06-24T14:44:55Z

Running test locally

Flash Attention

RUN_SLOW=1 python -m pytest --verbose -m flash_attn_test \
    tests/models/siglip/test_modeling_siglip.py

SDPA

RUN_SLOW=1 python -m pytest --verbose \
    tests/models/siglip/test_modeling_siglip.py -k "sdpa"

qubvel · 2024-06-24T16:09:27Z

@amyeroberts @molbap please review if you have time

amyeroberts

Thanks for adding this!

Main comment is about the attention selection, we should instead be instantiating the model components with from_config (or possibly _from_config?) and passing in `attn_implementation=config.attn_implementation).
Could we extend this to CLIP and add both at the same time?
There should be a section added to the model doc page + benchmarks showing some times for improvements for SDPA and FA2 e.g. like here for mistral.

amyeroberts · 2024-06-24T16:37:20Z

src/transformers/models/siglip/modeling_siglip.py

@@ -543,6 +825,33 @@ def _init_weights(self, module):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

+    @classmethod
+    def _autoset_attn_implementation(


c.f. #31203 (comment)

I let this slip in for IDEFICS2, but it never should have been included

tests/models/siglip/test_modeling_siglip.py

amyeroberts · 2024-06-24T16:40:22Z

tests/models/siglip/test_modeling_siglip.py

@@ -55,6 +67,178 @@
    from transformers import SiglipProcessor


+class SiglipModelTesterMixin(ModelTesterMixin):


cc @ydshieh for comments / opinions on this mixin structure within the model's testing file

IIRC, it's just to overwrite test_eager_matches_sdpa_inference (which is a large block) in ModelTesterMixin).

So we don't really need to have this new class. However, there are 3 or more model test classes in this file. It's nice to have SiglipModelTesterMixin and just overwrite with mini block like

def test_eager_matches_sdpa_inference(self, torch_dtype: str): super().test_eager_matches_sdpa_inference( torch_dtype=torch_dtype, logit_keys=("pooler_output", "last_hidden_state"), use_attnetion_mask_options=(False,), )

amyeroberts · 2024-06-24T16:41:43Z

tests/models/siglip/test_modeling_siglip.py

+    @require_torch_gpu
+    @mark.flash_attn_test
+    @slow
+    @is_flaky()


Is it flaky for siglip? Do we know why? I know we have this decorator for the common tests, but not for the model-specific implementations

I didn't notice it's flaky for Siglip, however, I don't know how much is it hardware, cuda/fa2 version specific. So, decided to make it also flaky as in the initial common implementation. I will remove it to make it consistent with other model-specific tests.

qubvel · 2024-06-25T09:53:18Z

@amyeroberts

Main comment is about the attention selection, we should instead be instantiating the model components with from_config (or possibly _from_config?) and passing in `attn_implementation=config.attn_implementation).

I tried to make it with _from_config, however, internal model components are not inherited from PretrainedModel, they are just pure nn.Module and don't have _from_config method. I changed the inheritance, but that led to other issues: models have to be included in docs and some fields have to be specified.

Not sure we can change internal model components too, for example, change SiglipVisionTransformer(nn.Module) to SiglipVisionModel(SiglipPreTrainedModel). This will lead to incompatible checkpoints.

See implementation 669c537

~~Do you have any thoughts on that?~~

P.S. I am looking at #30390 with similar questions 👀

Could we extend this to CLIP and add both at the same time?

I hope we can, I will check this!

P.S. Given the work made in #30390, probably will be better to continue work on this, rather than merge both PRs

There should be a section added to the model doc page + benchmarks showing some times for improvements for SDPA and FA2 e.g. like here for mistral.

Addressed in d41955d

sayakpaul · 2024-06-25T09:58:09Z

Regarding the CLIP part: #30390. Feel free to cherry-pick commits if that help. I got stuck in some FLAX tests that I never got time to get resolved.

ydshieh · 2024-06-25T11:45:12Z

I tried to make it with _from_config, however, internal model components are not inherited from PretrainedModel, they are just pure nn.Module and don't have _from_config method

Hi, regarding this part, even if an internal component is just nn.Module, we can still passing config to its __init__, just like what is done in GemmaDecoderLayer. Hope this could help.

class GemmaDecoderLayer(nn.Module):
    def __init__(self, config: GemmaConfig, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = GEMMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)

qubvel · 2024-06-24T16:04:50Z

tests/models/siglip/test_modeling_siglip.py

+
+        atols = {
+            ("cpu", False, torch.float32): 1e-5,
+            ("cpu", False, torch.bfloat16): 3e-2,


Maximum diff for bfloat16 is increased 1e-2 -> 3e-2 compared to the common model test.

qubvel · 2024-06-25T09:36:28Z

tests/models/siglip/test_modeling_siglip.py

+    @require_torch_gpu
+    @mark.flash_attn_test
+    @slow
+    @is_flaky()


I didn't notice it's flaky for Siglip, however, I don't know how much is it hardware, cuda/fa2 version specific. So, decided to make it also flaky as in the initial common implementation. I will remove it to make it consistent with other model-specific tests.

qubvel · 2024-06-25T09:41:57Z

tests/models/siglip/test_modeling_siglip.py

+    @require_torch_gpu
+    @mark.flash_attn_test
+    @slow
+    @is_flaky()


qubvel · 2024-06-26T17:09:19Z

I changed the attention implementation propagation as follows (commit 23457a2):

Initialize the *Model class instead of the *Transformer module to utilize the _from_config method.
Use only the submodule of the *Model to maintain the overall model structure and weights for backward compatibility.

# First, initialize the text and vision models with proper attention implementation
text_model = SiglipTextModel._from_config(text_config, attn_implementation=config._attn_implementation)
vision_model = SiglipVisionModel._from_config(vision_config, attn_implementation=config._attn_implementation)

# Second, get the text and vision submodules (for backward compatibility)
self.text_model = text_model.text_model
self.vision_model = vision_model.vision_model

With this approach, the underlying modules will exhibit the same behavior for the attn_implementation setting.

The disadvantage of this method is that the post_init() method is called twice: once for each *Model and again for the parent model.

@amyeroberts please let me know what you think.

molbap

Did a small review, then off so posting it: mostly typos/minor suggestions :)

tests/models/siglip/test_modeling_siglip.py

src/transformers/models/siglip/modeling_siglip.py

amyeroberts

🔥 🔥 🔥 🔥 🔥 🔥

Thanks for adding and iterating on this! Re the double call to post_init, it's not ideal, but I think should be fine as on the second call all the layers should be marked as initialized

amyeroberts · 2024-06-26T17:27:25Z

src/transformers/models/siglip/modeling_siglip.py

@@ -786,7 +1069,7 @@ def forward(

        # note: SigLIP's text model does not use a causal mask, unlike the original CLIP model.
        # expand attention_mask
-        if attention_mask is not None:
+        if attention_mask is not None and not self._use_flash_attention_2:


Do we need special attention_mask preparation for the SDPA case?

I guess we don't need it. I additionally tested with the following code and with the same attention mask preparation outputs matched for both implementations eager and sdpa

import torch from transformers.modeling_attn_mask_utils import _prepare_4d_attention_mask from transformers import SiglipConfig from transformers.models.siglip.modeling_siglip import SiglipAttention, SiglipSdpaAttention torch.manual_seed(235093093) dtype = torch.float16 device = "cuda" # Configure config = SiglipConfig() hidden_size = 6 num_attention_heads = 1 seq_len = 5 batch_size = 1 config.vision_config.hidden_size = hidden_size config.vision_config.num_attention_heads = num_attention_heads # Eager attention attention = SiglipAttention(config.vision_config) attention = attention.to(dtype).to(device) # SDPA attention attention_sdpa = SiglipSdpaAttention(config.vision_config) attention_sdpa.load_state_dict(attention.state_dict()) attention_sdpa = attention_sdpa.to(dtype).to(device) # Prepare inputs dummy_input = torch.rand( [batch_size, seq_len, hidden_size], dtype=dtype, device=device, ) dummy_attention_mask = torch.ones( [batch_size, seq_len], dtype=dtype, device=device, ) # padding dummy_attention_mask[:1, -2:] = 0 print("Dummy attention mask:\n", dummy_attention_mask) # Prepare attention mask dummy_attention_mask_eager = _prepare_4d_attention_mask( dummy_attention_mask, dummy_input.dtype ) # 1, 1, 512, 512 -> batch_size, 1, seq_len, seq_len # the same for SDPA dummy_attention_mask_sdpa = dummy_attention_mask_eager with torch.no_grad(): attn_output, attn_weights = attention(dummy_input, dummy_attention_mask_eager) attn_output_sdpa, attn_weights_sdpa = attention_sdpa(dummy_input, dummy_attention_mask_sdpa) print("\nEager:\n", attn_output) print("\nSDPA:\n", attn_output_sdpa) print("\nDiff:\n", attn_output - attn_output_sdpa) diff_with_sdpa = torch.abs(attn_output - attn_output_sdpa).max() print("\nDiff with SDPA:", diff_with_sdpa)

Dummy attention mask: tensor([[1., 1., 1., 0., 0.]], device='cuda:0', dtype=torch.float16) Eager: tensor([[[ 0.0267, 0.3291, 0.3442, 0.6152, -0.1914, 0.1541], [ 0.0174, 0.3289, 0.3398, 0.6138, -0.1968, 0.1560], [ 0.0283, 0.3301, 0.3447, 0.6162, -0.1914, 0.1542], [ 0.0176, 0.3289, 0.3401, 0.6138, -0.1968, 0.1559], [ 0.0228, 0.3293, 0.3423, 0.6147, -0.1941, 0.1550]]], device='cuda:0', dtype=torch.float16) SDPA: tensor([[[ 0.0267, 0.3291, 0.3442, 0.6152, -0.1914, 0.1541], [ 0.0174, 0.3289, 0.3398, 0.6138, -0.1968, 0.1560], [ 0.0283, 0.3301, 0.3447, 0.6162, -0.1914, 0.1542], [ 0.0176, 0.3289, 0.3401, 0.6138, -0.1968, 0.1559], [ 0.0228, 0.3293, 0.3423, 0.6147, -0.1941, 0.1550]]], device='cuda:0', dtype=torch.float16) Diff: tensor([[[0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.]]], device='cuda:0', dtype=torch.float16) Diff with SDPA: tensor(0., device='cuda:0', dtype=torch.float16)

amyeroberts · 2024-06-26T20:23:48Z

docs/source/en/model_doc/siglip.md

+
+## Expected speedups
+
+Below is an expected speedup diagram that compares inference time between the native implementation in transformers using `google/siglip-so400m-patch14-384` checkpoint in `float16` precision and the Flash Attention 2 / SDPA version of the model using different batch sizes.


lucasjinreal · 2024-07-08T11:29:08Z

Thanks for the work!

Which latest version on pypi would support this feature?

qubvel · 2024-07-08T11:49:08Z

Hi @lucasjinreal, it's going to be included in the next release, most probably 4.43.0. You can try it now by installing transformers from the source

pip install -U git+https://github.com/huggingface/transformers.git

lucasjinreal · 2024-07-08T11:55:52Z

I will waiting for the 4.43, my server unable to access network, any estimate time will 4.43 out?

amyeroberts · 2024-07-08T13:44:03Z

@lucasjinreal Just to confirm, you can't access github from your server? If it's just relating to internet, you would also need access for installing the latest release from pypi.

We typically release on a monthly schedule. You can see the list of releases here. The next minor release will probably be in 2-3 weeks.

lucasjinreal · 2024-07-09T11:46:05Z

thanks, will try update weeks later.

qubvel marked this pull request as draft June 19, 2024 16:56

qubvel changed the title ~~Add FA2 support for SigLIP~~ Add FA2 and sdpa support for SigLIP Jun 20, 2024

qubvel force-pushed the siglip-fa2-support branch from c593ac3 to a12367b Compare June 24, 2024 13:37

qubvel added the run-slow label Jun 24, 2024

qubvel marked this pull request as ready for review June 24, 2024 14:55

amyeroberts reviewed Jun 24, 2024

View reviewed changes

ydshieh self-assigned this Jun 24, 2024

qubvel commented Jun 25, 2024

View reviewed changes

qubvel added 14 commits June 26, 2024 10:04

Rebase to main

39006ef

Fix attention implementation autoset for tex and vision configs

24c7c42

Fixup

09499b9

Minor fixes

af7715b

Fix copies

3416852

Fix attention_mask for FA2

050051c

Add eqvivalence tests for siglip

8c7bb9b

Remove right padding test

b0f4d3a

Uncomment flaky

045e56e

Fix import

2981492

Add to docs

046eb5e

Fix test message

bea1219

Add sdpa

334fd7a

Add sdpa equivalence test

a534a72

qubvel added 12 commits June 26, 2024 10:04

Fix signature of FA2

1b1311e

Autoset attn_implementation in config

abf9aee

Rename bsz -> batch_size

d3bf29c

Move back autoset attn method

6ec4f79

Mark as flaky

a317f95

Correct attention mask padding

21e29da

[run-slow] siglip

39f4faf

Add FA2 and sdpa docs

8929af7

Style fix

f006d64

Remove flaky for FA2 test

8dbc12a

Change attention implementation set

579ce7b

Change attn_implementaiton propogation

23457a2

qubvel force-pushed the siglip-fa2-support branch from baf5b7b to 23457a2 Compare June 26, 2024 10:07

molbap reviewed Jun 26, 2024

View reviewed changes

amyeroberts approved these changes Jun 26, 2024

View reviewed changes

qubvel added 6 commits June 27, 2024 08:46

Fix typos

2647b9e

Add modality to assert message

598792f

Add more sdpa backends in test

26c6770

[run slow] siglip

e3ec20a

Add math sdpa backend for all options

a92249e

[run slow] siglip

1765a05

qubvel merged commit a177821 into huggingface:main Jul 8, 2024
23 checks passed

zucchini-nlp mentioned this pull request Jul 10, 2024

Correct check for SDPA in Vision Language Models #30565

Closed

12 tasks

qubvel mentioned this pull request Jul 12, 2024

Add sdpa and FA2 for CLIP #31940

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FA2 and `sdpa` support for SigLIP #31499

Add FA2 and `sdpa` support for SigLIP #31499

qubvel commented Jun 19, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 19, 2024

amyeroberts commented Jun 20, 2024

qubvel commented Jun 20, 2024

qubvel commented Jun 24, 2024 •

edited

Loading

qubvel commented Jun 24, 2024

amyeroberts left a comment

amyeroberts Jun 24, 2024

amyeroberts Jun 24, 2024

ydshieh Jun 25, 2024 •

edited

Loading

amyeroberts Jun 24, 2024

qubvel Jun 25, 2024

qubvel Jun 25, 2024

qubvel commented Jun 25, 2024 •

edited

Loading

sayakpaul commented Jun 25, 2024

ydshieh commented Jun 25, 2024

qubvel Jun 24, 2024

qubvel Jun 25, 2024

qubvel Jun 25, 2024

qubvel commented Jun 26, 2024

molbap left a comment

amyeroberts left a comment

amyeroberts Jun 26, 2024

qubvel Jul 1, 2024

amyeroberts Jun 26, 2024

lucasjinreal commented Jul 8, 2024

qubvel commented Jul 8, 2024

lucasjinreal commented Jul 8, 2024

amyeroberts commented Jul 8, 2024 •

edited

Loading

lucasjinreal commented Jul 9, 2024

		@@ -55,6 +67,178 @@
		from transformers import SiglipProcessor


		class SiglipModelTesterMixin(ModelTesterMixin):


		## Expected speedups

		Below is an expected speedup diagram that compares inference time between the native implementation in transformers using `google/siglip-so400m-patch14-384` checkpoint in `float16` precision and the Flash Attention 2 / SDPA version of the model using different batch sizes.

Add FA2 and sdpa support for SigLIP #31499

Add FA2 and sdpa support for SigLIP #31499

Conversation

qubvel commented Jun 19, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jun 19, 2024

amyeroberts commented Jun 20, 2024

qubvel commented Jun 20, 2024

qubvel commented Jun 24, 2024 • edited Loading

Flash Attention

SDPA

qubvel commented Jun 24, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qubvel commented Jun 25, 2024 • edited Loading

sayakpaul commented Jun 25, 2024

ydshieh commented Jun 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qubvel commented Jun 26, 2024

molbap left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucasjinreal commented Jul 8, 2024

qubvel commented Jul 8, 2024

lucasjinreal commented Jul 8, 2024

amyeroberts commented Jul 8, 2024 • edited Loading

lucasjinreal commented Jul 9, 2024

Add FA2 and `sdpa` support for SigLIP #31499

Add FA2 and `sdpa` support for SigLIP #31499

qubvel commented Jun 19, 2024 •

edited

Loading

qubvel commented Jun 24, 2024 •

edited

Loading

ydshieh Jun 25, 2024 •

edited

Loading

qubvel commented Jun 25, 2024 •

edited

Loading

amyeroberts commented Jul 8, 2024 •

edited

Loading