Llama: SDPA FA2 path + static cache fix #30437

gante · 2024-04-23T17:43:37Z

What does this PR do?

Problem

The recently enabled SDPA FA2 path doesn't pass the attn_mask (causal mask) argument. As such, when the static cache is used, S (see the SDPA docs for terminology) is the full cache length as opposed to the sequence length. Therefore, the inferred mask in SDPA is incorrect, resulting in bad numerical values.

PR that introduced the issue: #30317. The issue was not caught in our llama testing suite because we didn't have a test for the static cache WITHOUT compilation:

dynamic cache -> no issue with the length
static cache + compile -> we manually build the causal attention mask, passing it to sdpa
static cache -> we don't build the causal mask = causal mask is None = problem described above triggered

Solution

There were two possible paths here:

Build and pass a full attention mask, corresponding to the full cache length, to avoid causal mask is None (Fix attn mask for static cache #30414 );
Crop empty KV entries (corresponding to the empty cache) before SDPA.

I went with 2, as it saves us tons of masked computations :) I've also ensured the static cache without compilation is numerically tested in the llama test file.

Fixes #30417

Slow tests ran locally: llama, gemma, cohere, test_cache_utils.py.

gante · 2024-04-23T17:45:33Z

cc @younesbelkada @fxmarty (SDPA + FA2 changes)
cc @zucchini-nlp (alternative fix to #30414, see possible solutions in the PR header)

gante · 2024-04-23T17:48:21Z

src/transformers/models/cohere/modeling_cohere.py

@@ -533,10 +533,13 @@ def forward(
        cache_position: Optional[torch.LongTensor] = None,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        if output_attentions:
-            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
+            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is


120 char limit OCD :D

It is not in make style is it?

transformers/pyproject.toml

Line 5 in a98c417

# Never enforce `E501` (line length violations).

HuggingFaceDocBuilderDev · 2024-04-23T18:06:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

LGTM ! Thanks !

ArthurZucker

Okay very interesting and partially related to #30442.
Before merging, let's test that this does not affect compiled performances, as indexing can be costly.

gante · 2024-04-24T14:50:33Z

As Arthur wrote -- the current state of the PR adds a slowdown on the eager path.

I'm exploring an alternative path: first standardize our static cache to behave like our other caches (living outside the model as a stand-alone object), then forcing the generation of the full mask if a static cache is used (which fixes this issue).

fxmarty

Thanks a lot for noticing, indeed I should have tested static without compile. It is very good to add a test for it.

I am wondering - couldn this change cause issues with cuda graph capture, adding dynamicity in the tensor shapes? Making the capture slower?

fxmarty · 2024-04-26T06:48:02Z

src/transformers/models/llama/modeling_llama.py

@@ -1073,6 +1080,7 @@ def _update_causal_mask(
        if self.config._attn_implementation == "sdpa":
            # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument,
            # in order to dispatch on Flash Attention 2.
+            breakpoint()


fxmarty · 2024-04-26T06:48:25Z

tests/models/llama/test_modeling_llama.py

+        # `torch==2.2` will throw an error on this test (as in other compilation tests), but torch==2.1.2 and torch>2.2
+        # work as intended. See https://github.com/pytorch/pytorch/issues/121943


you can skipTest on torch version

fxmarty · 2024-04-26T06:49:27Z

src/transformers/models/cohere/modeling_cohere.py

@@ -533,10 +533,13 @@ def forward(
        cache_position: Optional[torch.LongTensor] = None,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        if output_attentions:
-            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
+            # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is


It is not in make style is it?

transformers/pyproject.toml

Line 5 in a98c417

# Never enforce `E501` (line length violations).

fxmarty · 2024-04-26T06:50:49Z

tests/models/llama/test_modeling_llama.py

+        # Dynamic Cache
+        generated_ids = model.generate(**inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False)
+        dynamic_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+        self.assertEqual(EXPECTED_TEXT_COMPLETION, dynamic_text)
+
+        # Static Cache
+        generated_ids = model.generate(
+            **inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, cache_implementation="static"
+        )
+        static_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+        self.assertEqual(EXPECTED_TEXT_COMPLETION, static_text)
+
+        # Static Cache + compile
+        model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
+        generated_ids = model.generate(
+            **inputs, max_new_tokens=NUM_TOKENS_TO_GENERATE, do_sample=False, cache_implementation="static"
+        )
+        static_compiled_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+        self.assertEqual(EXPECTED_TEXT_COMPLETION, static_compiled_text)


Shouldn't the static cache tests be somewhere like test_modeling_common.py and test every supported models?

younesbelkada · 2024-04-26T08:24:45Z

tests/models/llama/test_modeling_llama.py

        NUM_TOKENS_TO_GENERATE = 40
-        EXPECTED_TEXT_COMPLETION = {
-            7: [


note you might get different results from A10 to T4, hence this dict. this change can lead to the push-important-models test + Slow tests to fail 😢
You can now SSH into our runners and get the value of the generations for each device type

gante · 2024-04-26T13:18:07Z

(closed in favor of #30476, a much better long-term solution)

gante requested a review from ArthurZucker April 23, 2024 17:43

gante mentioned this pull request Apr 23, 2024

Fix attn mask for static cache #30414

Closed

gante commented Apr 23, 2024

View reviewed changes

younesbelkada approved these changes Apr 24, 2024

View reviewed changes

gante added 3 commits April 24, 2024 10:30

tmp commit

6c3ef09

propagate changes to gemma and cohere

872411c

empty commit to test a thing in CI

631c2da

gante force-pushed the spda_fa2_static_fix branch from 8b15576 to 631c2da Compare April 24, 2024 10:30

ArthurZucker reviewed Apr 24, 2024

View reviewed changes

tmp commit

187bb56

zucchini-nlp mentioned this pull request Apr 25, 2024

Llama generation with static cache fails in certain sequence lengths #30400

Closed

4 tasks

fxmarty reviewed Apr 26, 2024

View reviewed changes

younesbelkada reviewed Apr 26, 2024

View reviewed changes

gante mentioned this pull request Apr 26, 2024

Cache: Static cache as a standalone object #30476

Merged

gante closed this Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama: SDPA FA2 path + static cache fix #30437

Llama: SDPA FA2 path + static cache fix #30437

gante commented Apr 23, 2024 •

edited

Loading

gante commented Apr 23, 2024

gante Apr 23, 2024

fxmarty Apr 26, 2024

HuggingFaceDocBuilderDev commented Apr 23, 2024

younesbelkada left a comment

ArthurZucker left a comment

gante commented Apr 24, 2024

fxmarty left a comment •

edited

Loading

fxmarty Apr 26, 2024

fxmarty Apr 26, 2024

fxmarty Apr 26, 2024

fxmarty Apr 26, 2024

younesbelkada Apr 26, 2024

gante commented Apr 26, 2024

		# `torch==2.2` will throw an error on this test (as in other compilation tests), but torch==2.1.2 and torch>2.2
		# work as intended. See https://github.com/pytorch/pytorch/issues/121943

Llama: SDPA FA2 path + static cache fix #30437

Llama: SDPA FA2 path + static cache fix #30437

Conversation

gante commented Apr 23, 2024 • edited Loading

What does this PR do?

Problem

Solution

gante commented Apr 23, 2024

gante Apr 23, 2024

Choose a reason for hiding this comment

fxmarty Apr 26, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 23, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented Apr 24, 2024

fxmarty left a comment • edited Loading

Choose a reason for hiding this comment

fxmarty Apr 26, 2024

Choose a reason for hiding this comment

fxmarty Apr 26, 2024

Choose a reason for hiding this comment

fxmarty Apr 26, 2024

Choose a reason for hiding this comment

fxmarty Apr 26, 2024

Choose a reason for hiding this comment

younesbelkada Apr 26, 2024

Choose a reason for hiding this comment

gante commented Apr 26, 2024

gante commented Apr 23, 2024 •

edited

Loading

fxmarty left a comment •

edited

Loading