[`GPTNeoX`] Flex Attention + Refactor #34896

vasqu · 2024-11-23T16:31:38Z

What does this PR do?

Adds flex attention and the refactor according to #34809

However, I discovered several issues in the current version of gemma2 (#34282):

It seems like that flex attention needs a transpose afterwards like sdpa
Loading flex attn with from pretrained didn't work and hence, current tests use another attn implementation (eager or sdpa not sure again)
Tests could gain from similar tests like sdpa :D for now it's a bit of a hassle to always have some integration test added when it could be a more general test for all subsequent models
I'm not familiar with better transformers or limitations of flex attn --> added some todos in case we need to check in
Flex attn doesn't support dropout (or maybe I've overlooked something)
Setting model.config._attn_implementation = ... should be tracked somewhere and checked for sanity as done the first time - for now it silently overwrites and could cause some ugly errors (tested with changing to flash attention 2 while not having fa2 installed)
Documentation should be added somewhere (prolly perf or something else)

So tbh, I'm not sure whether to split this PR into several ones, e.g. a gemma fix, general loading, general tests, docs, and then subsequent models, or not

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

vasqu

A collection of comments which partially show the issues I listed above

src/transformers/modeling_utils.py

src/transformers/utils/import_utils.py

vasqu · 2024-11-23T16:35:50Z

tests/models/gpt_neox/test_modeling_gpt_neox.py

+    @slow
+    def test_lm_generate_flex_attn_gptneox(self):
+        tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-410m-deduped")
+        for checkpointing in [True, False]:
+            model = GPTNeoXForCausalLM.from_pretrained(
+                "EleutherAI/pythia-410m-deduped", attn_implementation="flex_attention"
+            )
+
+            if checkpointing:
+                model.gradient_checkpointing_enable()
+            else:
+                model.gradient_checkpointing_disable()
+            model.to(torch_device)
+
+            inputs = tokenizer("My favorite food is", return_tensors="pt").to(torch_device)
+            # The hub repo. is updated on 2023-04-04, resulting in poor outputs.
+            # See: https://github.com/huggingface/transformers/pull/24193
+            expected_output = "My favorite food is a good old-fashioned, old-fashioned, old-fashioned.\n\nI'm not sure"
+
+            output_ids = model.generate(**inputs, do_sample=False, max_new_tokens=20)
+            output_str = tokenizer.batch_decode(output_ids)[0]
+
+            self.assertEqual(output_str, expected_output)


Would love to have common tests in the future (instead)

src/transformers/models/gpt_neox/modeling_gpt_neox.py

vasqu · 2024-11-23T16:38:59Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+        if (
+            self.training
+            and self.config.attention_dropout > 0
+            and self.config._attn_implementation == "flex_attention"
+        ):
+            logger.warning_once(
+                f"Setting `attention_type` to `eager` because `dropout` is not supported in {attention_type}"
+            )
+            attention_type = "eager"


No dropout in flex attn

that's a great catch! but we can add it to the score-mod no?

Sadly not, in the case of the head mask the order of ops doesn't matter since we completely turn the head to all zeros but dropout still depends on the correct distribution calculations and just then turns off some values.

The order of ops is: dropout(softmax(score_mod(Q, K))) --> we would introduce unwanted behaviour.

Edit: llama for ref

transformers/src/transformers/models/llama/modeling_llama.py

Lines 339 to 340 in 0b5b5e6

attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)

Fyi @ArthurZucker pytorch/pytorch#141784

src/transformers/models/gpt_neox/modeling_gpt_neox.py

vasqu · 2024-11-23T17:07:46Z

CI failings seem unrelated, flaky tests (e.g. XLM, Qwen2VL)

vasqu · 2024-11-23T17:16:45Z

Possible TODO -> fallback to eager when using head mask in fa2, sdpa + add head mask in flex attention (should be possible via score mod)

Edit: Added now

dame-cell · 2024-11-23T19:30:33Z

Yes I was testing it for Gemma as well it needed a transpose at the end as well

If you don't mind could you check the pull request i did for gemma seems I keep failing some tests

Also the gemma 2 now supports new stuff in the configuration which confused me a lot
The attn logit soft capping

Also the model.config._attn_implementation is not really implemented correctly for example it does not correctly uses the correct attn upon choosing one

Still working on the gemma flex attention pr might help with the docs as well

vasqu · 2024-11-23T19:33:18Z

@dame-cell I'll take a look tomorrow! I'm busted for today :)

But as quick thing to let the loading be handled correctly look into my changes into the utils folder and modeling_utils. With those changes, loading should be handled correctly. Tbh, that's one of the main reasons why I think it might be better to split some PRs and get loading etc correctly first before we start adding.

Edit: One last thing to change would be to add _supports_flex_attn = True then like done for sdpa, fa2

dame-cell · 2024-11-23T19:36:04Z

@dame-cell I'll take a look tomorrow! I'm busted for today :)

But as quick thing to let the loading be handled correctly look into my changes into the utils folder and modeling_utils. With those changes, loading should be handled correctly. Tbh, that's one of the main reasons why I think it might be better to split some PRs and get loading etc correctly first before we start adding.

Hmmm ohh I get it I see thanks for letting me know 😀

ArthurZucker · 2024-11-25T16:26:41Z

Feel free to ping me once you feel like this is ready! 🤗

vasqu · 2024-11-25T16:28:19Z

I think it should be ready @ArthurZucker just found something on the fly just a min ago, should be good to go 😄

Edit: the ci failure doesn't seem related to this PR

ArthurZucker

Thanks for improving the API _check_and_enable_flex_attn was missing from my inital PR!

src/transformers/modeling_utils.py

src/transformers/models/gpt_neox/modeling_gpt_neox.py

ArthurZucker · 2024-11-26T16:12:30Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+        if (
+            self.training
+            and self.config.attention_dropout > 0
+            and self.config._attn_implementation == "flex_attention"
+        ):
+            logger.warning_once(
+                f"Setting `attention_type` to `eager` because `dropout` is not supported in {attention_type}"
+            )
+            attention_type = "eager"


that's a great catch! but we can add it to the score-mod no?

src/transformers/models/gpt_neox/modeling_gpt_neox.py

vasqu · 2024-11-26T18:24:22Z

I'll take a look tomorrow or the day after :)

Co-authored-by: Arthur <[email protected]>

vasqu

I think the PR gets slowly wrapped up; some things that should be made into separate PR(s) imo:

Common tests
Docs
Tracking when attn implementation is manually changed within the config

vasqu · 2024-11-28T19:08:57Z

src/transformers/modeling_flash_attention_utils.py

+    # PEFT possibly silently casts tensors to fp32, this potentially reconverts to correct dtype or is a no op
+    query_states, key_states, value_states = fa_peft_integration_check(
+        query_states, key_states, value_states, target_dtype
+    )
+


New peft check within the FA2 interface.

Nice, we can remove it from all of the other modeling code 👀

I'd leave that to a separate PR, this PR is big enough already :p

src/transformers/modeling_utils.py

src/transformers/models/gpt_neox/modeling_gpt_neox.py

vasqu · 2024-11-28T19:16:28Z

@ArthurZucker updated per review, looking forward to the next round ;)

…d by other stuff like RoPE

vasqu · 2024-11-29T03:00:23Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+        score_mod=causal_mod,
+        enable_gqa=True,
+        scale=norm_factor,
+        return_lse=output_attentions,


I think we could drop the output attentions here, always return both and let the remaining forward (we call from) handle it.

Looked into the torch code, and they also always return both but make an if/else to return just one or both so there shouldn't be any downside imo.

Okay if torch code does that, it makes sense (no additional computation).
Let's add a comment as to why we do it and good for me!

Added a comment!

ArthurZucker

Very nice!
Left a few small comments but almost ready to go!

src/transformers/modeling_flash_attention_utils.py

ArthurZucker · 2024-12-02T13:53:33Z

src/transformers/modeling_flash_attention_utils.py

+    # PEFT possibly silently casts tensors to fp32, this potentially reconverts to correct dtype or is a no op
+    query_states, key_states, value_states = fa_peft_integration_check(
+        query_states, key_states, value_states, target_dtype
+    )
+


ArthurZucker · 2024-12-02T13:53:33Z

src/transformers/modeling_flash_attention_utils.py

+    # PEFT possibly silently casts tensors to fp32, this potentially reconverts to correct dtype or is a no op
+    query_states, key_states, value_states = fa_peft_integration_check(
+        query_states, key_states, value_states, target_dtype
+    )
+


Nice, we can remove it from all of the other modeling code 👀

src/transformers/utils/import_utils.py

tests/models/gpt_neox/test_modeling_gpt_neox.py

src/transformers/models/gpt_neox/modeling_gpt_neox.py

ArthurZucker · 2024-12-02T13:57:20Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+        score_mod=causal_mod,
+        enable_gqa=True,
+        scale=norm_factor,
+        return_lse=output_attentions,


Okay if torch code does that, it makes sense (no additional computation).
Let's add a comment as to why we do it and good for me!

ArthurZucker · 2024-12-02T13:59:45Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+        target_dtype = None
+        if self.config._attn_implementation == "flash_attention_2":
+            input_dtype = value.dtype
+            if input_dtype == torch.float32:
+                if torch.is_autocast_enabled():
+                    target_dtype = torch.get_autocast_gpu_dtype()
+                # Handle the case where the model is quantized
+                elif hasattr(self.config, "_pre_quantization_dtype"):
+                    target_dtype = self.config._pre_quantization_dtype
+                else:
+                    target_dtype = self.query_key_value.weight.dtype


This is quite heavy! would be cool if we manage to only do it in the flash attention forward function ! (passing just the config for example would be enough to do so)

The problem is that we are dependent on the models weights in the last case, i.e. self.query_key_value.weight.dtype - I moved it to a separate function but it definitely should be deprecated at some point.

src/transformers/models/gpt_neox/modeling_gpt_neox.py

Co-authored-by: Arthur <[email protected]>

vasqu

Just a list of TODOs (separate PRs):

Otherwise, I think this PR is good now!

src/transformers/models/gpt_neox/modeling_gpt_neox.py

vasqu · 2024-12-02T18:54:29Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+    # lse is returned in float32
+    attn_weights = attn_weights.to(value.dtype)


Reconvert to correct dtype

sounds good!

vasqu · 2024-12-02T18:55:22Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+            # Flash Attention 2 specific PEFT check
+            target_dtype=self._fa_peft_dtype_check(value),


Like I said in another comment dependent on weights so it's hard to move it to the fa forward without passing additional info (like the weights dtype).

vasqu · 2024-12-02T18:55:43Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+    def _fa_peft_dtype_check(self, value):
+        """
+        PEFT can silently cast the dtype to float32 - this method returns the target dtype to which
+        FA should convert back to (if necessary). For now, we can not move this to the forward pass
+        itself due to the dependency on checking on some part of its own weights (last case).
+        """
+        target_dtype = None
+        if self.config._attn_implementation == "flash_attention_2":
+            input_dtype = value.dtype
+            if input_dtype == torch.float32:
+                if torch.is_autocast_enabled():
+                    target_dtype = torch.get_autocast_gpu_dtype()
+                # Handle the case where the model is quantized
+                elif hasattr(self.config, "_pre_quantization_dtype"):
+                    target_dtype = self.config._pre_quantization_dtype
+                else:
+                    target_dtype = self.query_key_value.weight.dtype
+        return target_dtype


The new function I mentioned.

vasqu · 2024-12-02T18:59:35Z

@ArthurZucker Hopefully the last round 🤞

ArthurZucker

LGTM now thanks a lot for the refactor!

src/transformers/models/gpt_neox/modeling_gpt_neox.py

ArthurZucker · 2024-12-04T13:14:22Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+    # lse is returned in float32
+    attn_weights = attn_weights.to(value.dtype)


sounds good!

ArthurZucker · 2024-12-04T13:48:00Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+    def _fa_peft_dtype_check(self, value):
+        """
+        PEFT can silently cast the dtype to float32 - this method returns the target dtype to which
+        FA should convert back to (if necessary). For now, we can not move this to the forward pass
+        itself due to the dependency on checking on some part of its own weights (last case).
+        """
+        target_dtype = None
+        if self.config._attn_implementation == "flash_attention_2":
+            input_dtype = value.dtype
+            if input_dtype == torch.float32:
+                if torch.is_autocast_enabled():
+                    target_dtype = torch.get_autocast_gpu_dtype()
+                # Handle the case where the model is quantized
+                elif hasattr(self.config, "_pre_quantization_dtype"):
+                    target_dtype = self.config._pre_quantization_dtype
+                else:
+                    target_dtype = self.query_key_value.weight.dtype
+        return target_dtype


* gpt neox flex attention + refactor * some formatting * small fix on dropout * add assertion on flex attn test * flaky ci :( * add head mask support * style * handle dtype, replace torch where * fixup flex with output attns * code review and several other fixes * Update src/transformers/modeling_utils.py Co-authored-by: Arthur <[email protected]> * style * remove unnecessary comment * remove incorrect comment * make flex attn check more agnostic tor versions and centralized * change peft input dtype check to value since q and k could be affected by other stuff like RoPE * i forgor * flaky * code review and small fixes * Update src/transformers/models/gpt_neox/modeling_gpt_neox.py Co-authored-by: Arthur <[email protected]> --------- Co-authored-by: Arthur <[email protected]>

vasqu added 2 commits November 23, 2024 17:12

gpt neox flex attention + refactor

65da253

some formatting

ebfdc1f

vasqu commented Nov 23, 2024

View reviewed changes

vasqu added 3 commits November 23, 2024 17:42

small fix on dropout

b3c2b11

add assertion on flex attn test

06685f8

flaky ci :(

53f7319

vasqu added 3 commits November 23, 2024 19:15

add head mask support

21941c5

style

0208905

handle dtype, replace torch where

e689d28

mayankagarwals mentioned this pull request Nov 24, 2024

[WIP] Add flex attention for gpt2 #34861

Draft

fixup flex with output attns

7a444f5

ArthurZucker mentioned this pull request Nov 26, 2024

[FlexAttention] Update gemma2 #34942

Merged

ArthurZucker reviewed Nov 26, 2024

View reviewed changes

vasqu and others added 3 commits November 28, 2024 19:58

code review and several other fixes

bd22b3d

Update src/transformers/modeling_utils.py

94ae194

Co-authored-by: Arthur <[email protected]>

style

4c8c9a2

vasqu commented Nov 28, 2024

View reviewed changes

vasqu added 4 commits November 28, 2024 20:29

remove unnecessary comment

96f85b0

remove incorrect comment

086d0a5

make flex attn check more agnostic tor versions and centralized

bb3fb23

change peft input dtype check to value since q and k could be affecte…

5e6d904

…d by other stuff like RoPE

vasqu added 2 commits November 28, 2024 22:11

i forgor

e51aa86

flaky

1a326c1

vasqu commented Nov 29, 2024

View reviewed changes

ArthurZucker approved these changes Dec 2, 2024

View reviewed changes

vasqu and others added 2 commits December 2, 2024 19:48

code review and small fixes

e809428

Update src/transformers/models/gpt_neox/modeling_gpt_neox.py

05f51fe

Co-authored-by: Arthur <[email protected]>

vasqu commented Dec 2, 2024

View reviewed changes

ArthurZucker approved these changes Dec 4, 2024

View reviewed changes

ArthurZucker merged commit 46df859 into huggingface:main Dec 4, 2024
22 checks passed

This was referenced Dec 4, 2024

Add Flex Attention for Mistral along with refactoring #34845

Closed

Gemma flex attention #34851

Closed

[WIP] Add flex attention for qwen2 #34827

Closed

vasqu deleted the flex-gptneox branch December 4, 2024 15:25

	attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
	attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)

		# lse is returned in float32
		attn_weights = attn_weights.to(value.dtype)

		# Flash Attention 2 specific PEFT check
		target_dtype=self._fa_peft_dtype_check(value),

[GPTNeoX] Flex Attention + Refactor #34896

[GPTNeoX] Flex Attention + Refactor #34896

Conversation

vasqu commented Nov 23, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

vasqu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasqu Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasqu commented Nov 23, 2024

vasqu commented Nov 23, 2024 • edited Loading

dame-cell commented Nov 23, 2024 • edited Loading

vasqu commented Nov 23, 2024 • edited Loading

dame-cell commented Nov 23, 2024

ArthurZucker commented Nov 25, 2024

vasqu commented Nov 25, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasqu commented Nov 26, 2024

vasqu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasqu commented Nov 28, 2024

vasqu Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasqu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasqu commented Dec 2, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[`GPTNeoX`] Flex Attention + Refactor #34896

[`GPTNeoX`] Flex Attention + Refactor #34896

vasqu commented Nov 23, 2024 •

edited

Loading

vasqu Nov 28, 2024 •

edited

Loading

vasqu commented Nov 23, 2024 •

edited

Loading

dame-cell commented Nov 23, 2024 •

edited

Loading

vasqu commented Nov 23, 2024 •

edited

Loading

vasqu commented Nov 25, 2024 •

edited

Loading

vasqu left a comment •

edited

Loading

vasqu Nov 29, 2024 •

edited

Loading