Add a static cache that offloads to the CPU or other device #32161

gerbenvv · 2024-07-23T15:20:04Z

What does this PR do?

This PR adds a static cache that offloads to another device.

Fixes #32179

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @gante @n17s

Performance tests

Performance tested it with:

L40S (48 GB)
Llama 3 70B (4 bit quantized)
sdpa attention
torch.compile(model)
batch size of 1000
sequence size of 25 (no prompt, fully generative)

and I am getting a throughput of about 535 tokens/s (OOM static).

Also with:

batch size of 1
sequence size of 100 (no prompt, fully generative)

which gets 10.6 tokens/s (12.8 tokens/s static).

And with:

batch size of 1
sequence size of 100 (fully prompt)

which does 98.8 tokens/s (106.8 tokens/s static)

And with:

batch size of 50
sequence size of 100 (fully prompt)

which does 939.5 tokens/s (995.6 tokens/s static)

ArthurZucker

Already looks great! IMO would be nice to add a snippet of how to use it in the doc of the class + once #32150 is merged also add this as compatible with torch.compile (while the non static version won't be)

Really like the findings you posted on the issue, will be useful for everyone I think!

ArthurZucker · 2024-07-26T13:21:25Z

Also would be interesting to test this / showcase potential for huge beamsearch !

HuggingFaceDocBuilderDev · 2024-07-26T13:39:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

Very very cool! 🔥

As Arthur wrote, this PR is now missing the complementary parts:

An example in the docstring
Some benchmarks in the PR, for future reference
An integration test to prevent regressions (like the ones here)

src/transformers/cache_utils.py

gante · 2024-07-27T08:40:09Z

src/transformers/cache_utils.py

+        # For backwards compatibility.
+        self._seen_tokens = 0


Suggested change

# For backwards compatibility.

self._seen_tokens = 0

this one is unused throughout the code

See comment below.

gante · 2024-07-27T08:43:49Z

src/transformers/cache_utils.py

+    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
+        """Returns the sequence length of the cached states that were seen by the model."""
+
+        return self._seen_tokens


let's use the fn from the static cache (since we want to remove self._seen_tokens)

That would be a performance degradation compared to the current integer update of self._seen_tokens. Is there a plan to remove get_seq_length ? In that case, I would remove it then. Otherwise, this will be a lot slower since it will have to do (synced) CPU operation on the offloaded cache. Let me know what you think.

Yeah get_seq_length is deprecated in favor of the cache_positions which should not need CPU operations as you only use them on device

Okay, but shouldn't we just remove the get_seq_length methods including all neccesary variables (_seen_tokens) once it gets removed from the API? Since there is still quite a few usages throughout the codebase.

What I meant was that get_seq_length will be much less performant in the meantime if I switch it over to the StaticCache implementation. Which if you want that in the meantime, I'm fine with as well.

Ah sorry, no need to use the one from static, and actually yeah, we should probably just prevent user from using it -> no offloading for them?
Let's go with keeping seentoken for now, add a comment saying #TODO @gante remove this

we should probably just prevent user from using it -> no offloading for them?

The method get_seq_length works fine but it's used internally still. Hence my hesitation to remove it / revert it to the slower StaticCache version.

Sounds good!

Me and Arthur will handle the deprecation of both on all cache types + internal usage afterwards 🤗

@ArthurZucker Please review and merge when happy.

src/transformers/cache_utils.py

gerbenvv · 2024-08-06T12:17:02Z

@ArthurZucker @gante Thanks for reviewing! Made some fixes and added unit tests in a new commit. Also added the performance testing into the PR description.

gante

LGTM 🤗

Thank you for the cool feature and for iterating with us!

gante · 2024-08-07T15:31:08Z

docs/source/en/kv_cache.md

@@ -238,6 +238,24 @@ For more examples with Static Cache and JIT compilation, take a look at [StaticC
 "Hello, my name is [Your Name], and I am a [Your Profession] with [Number of Years] of"
 ```

+Like [`~OffloadedCache`] exists for offloading a "DynamicCache", there is also an offloaded static cache. Just


Suggested change

Like [`~OffloadedCache`] exists for offloading a "DynamicCache", there is also an offloaded static cache. Just

### Offloaded Static Cache

Like [`~OffloadedCache`] exists for offloading a "DynamicCache", there is also an offloaded static cache. Just

I think it deserves a subsection of its own 🤗

Will do, also, I will add it to the overview table.

Done! Although I do wonder what Initialization Recommended means in the overview table?

Whether it should be init outside generate and passed to generate or not cc @zucchini-nlp on this!

Some cache classes are recommended to initialize outside generation, e.g. StaticCache with compilation had some issues when we initialized cache while compiling.

Also, some cache types are not handled automatically by our API, e.g. SinkCache so the user has no option as to initialize and pass past_key_values

ArthurZucker

Well sorry for the late review, very very nice, let's make sure the slow tests pass and good to go! Can you try to run them locally? 🤗

ArthurZucker · 2024-08-16T17:26:13Z

src/transformers/cache_utils.py

+        self.dtype = dtype if dtype is not None else torch.float32
+
+        # Some model define a custom `head_dim` != config.hidden_size // config.num_attention_heads
+        head_dim = config.head_dim if hasattr(config, "head_dim") else config.hidden_size // config.num_attention_heads


Suggested change

head_dim = config.head_dim if hasattr(config, "head_dim") else config.hidden_size // config.num_attention_heads

head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)

ArthurZucker · 2024-08-28T13:22:54Z

@gerbenvv I can also merge like this and commit later if you are busy!

gerbenvv · 2024-08-28T16:46:15Z

@gerbenvv I can also merge like this and commit later if you are busy!

Yeah trying to run the tests now RUN_SLOW=1 pytest tests/utils/test_cache_utils.py but it's taking quite a while.
I also tried to install the full dev environment but it failed to install av due to some compile error.

I'll try running just the file pytest tests/utils/test_cache_utils.py and if that doesn't work for me, the tests that I have changed/added. Since it is basically a new class, that should be fine.

gerbenvv · 2024-08-28T17:06:31Z

Hmm, the tests have literally crashed the whole machine. Is it supposed to use all the GPUs on the machine?

I am struggling a bit to run this.

gerbenvv · 2024-08-28T17:16:20Z

RUN_SLOW=1 pytest tests/utils/test_cache_utils.py gave me:

========================================================================================================= short test summary info ==========================================================================================================
FAILED tests/utils/test_cache_utils.py::CacheTest::test_static_cache_exportability - OSError: You are trying to access a gated repo.                                                                                                        
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_dynamic_cache_batched - OSError: You are trying to access a gated repo.                                                                                                  
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_dynamic_cache_beam_search - OSError: You are trying to access a gated repo.                                                                                              
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_dynamic_cache_hard - OSError: You are trying to access a gated repo.                                                                                                     
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_hybrid_cache_n_sequences - OSError: You are trying to access a gated repo.                                                                                               
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_offloaded_cache_equivalent_to_dynamic_cache - ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`                
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_offloaded_cache_uses_less_memory_than_dynamic_cache - ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`        
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_sink_cache_iterative_prompts - ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`                               
===================================================================================== 8 failed, 14 passed, 2 skipped, 24 warnings in 627.82s (0:10:27) ====================================================================================

Then I tried to fix those errors by passing the token & installing accelerate but then the machine crashed. But we at least know that:

test_static_cache_greedy_decoding_pad_left
test_static_cache_greedy_decoding_pad_right
test_static_cache_extra_left_padding

ran succesful and those were the ones that I have changed.

gerbenvv · 2024-08-28T17:38:23Z

Okay, making progress ;-)

New output of CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 pytest tests/utils/test_cache_utils.py is:

========================================================================================================= short test summary info ==========================================================================================================
FAILED tests/utils/test_cache_utils.py::CacheTest::test_static_cache_exportability - torch._export.verifier.SpecViolationError: Node.meta _enter_autocast is missing val field.
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_hybrid_cache_n_sequences - ValueError: Greedy methods without beam search do not support `num_return_sequences` different than 1 (got 2).
FAILED tests/utils/test_cache_utils.py::CacheIntegrationTest::test_sink_cache_iterative_prompts - AssertionError: False is not true
===================================================================================== 3 failed, 19 passed, 2 skipped, 28 warnings in 756.35s (0:12:36) =====================================================================================

So no tests regarding the static, dynamic or offloading caches are failing. I think this should be good enough to get this merged, right?

ArthurZucker · 2024-08-29T09:51:06Z

Yep let's go! 🔥

…ace#32161) * Add a static cache that offloads to the CPU or other device * Fix PR comments, add unit-tests

gerbenvv changed the title ~~Add a static cache that offloads to the CPU or other device~~ [WIP] Add a static cache that offloads to the CPU or other device Jul 23, 2024

gerbenvv force-pushed the offloaded-shared-cache branch 4 times, most recently from 87541c5 to 8b862ab Compare July 25, 2024 11:30

gerbenvv changed the title ~~[WIP] Add a static cache that offloads to the CPU or other device~~ Add a static cache that offloads to the CPU or other device Jul 25, 2024

gerbenvv force-pushed the offloaded-shared-cache branch 2 times, most recently from 386f231 to 3d413cb Compare July 25, 2024 12:13

ArthurZucker self-requested a review July 26, 2024 10:25

ArthurZucker reviewed Jul 26, 2024

View reviewed changes

gante reviewed Jul 27, 2024

View reviewed changes

Add a static cache that offloads to the CPU or other device

5e2f832

gerbenvv force-pushed the offloaded-shared-cache branch from 3d413cb to b470b63 Compare August 6, 2024 12:14

gerbenvv force-pushed the offloaded-shared-cache branch 5 times, most recently from fd41a90 to 890db71 Compare August 7, 2024 12:42

gante approved these changes Aug 7, 2024

View reviewed changes

Fix PR comments, add unit-tests

dc8d226

gerbenvv force-pushed the offloaded-shared-cache branch from 890db71 to dc8d226 Compare August 7, 2024 15:52

gerbenvv requested a review from ArthurZucker August 8, 2024 12:14

ArthurZucker approved these changes Aug 22, 2024

View reviewed changes

ArthurZucker merged commit 5129671 into huggingface:main Aug 29, 2024
24 checks passed

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

Add a static cache that offloads to the CPU or other device (huggingf…

850d644

…ace#32161) * Add a static cache that offloads to the CPU or other device * Fix PR comments, add unit-tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a static cache that offloads to the CPU or other device #32161

Add a static cache that offloads to the CPU or other device #32161

gerbenvv commented Jul 23, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

ArthurZucker commented Jul 26, 2024

HuggingFaceDocBuilderDev commented Jul 26, 2024

gante left a comment •

edited

Loading

gante Jul 27, 2024

gerbenvv Aug 6, 2024 •

edited

Loading

gante Jul 27, 2024

gerbenvv Aug 6, 2024

ArthurZucker Aug 6, 2024

gerbenvv Aug 6, 2024

ArthurZucker Aug 6, 2024 •

edited

Loading

gerbenvv Aug 7, 2024

gerbenvv Aug 7, 2024 •

edited

Loading

gante Aug 7, 2024

gerbenvv Aug 12, 2024

gerbenvv commented Aug 6, 2024

gante left a comment

gante Aug 7, 2024

gerbenvv Aug 7, 2024

gerbenvv Aug 7, 2024

ArthurZucker Aug 22, 2024

zucchini-nlp Aug 22, 2024

ArthurZucker left a comment

ArthurZucker Aug 16, 2024

ArthurZucker commented Aug 28, 2024

gerbenvv commented Aug 28, 2024

gerbenvv commented Aug 28, 2024

gerbenvv commented Aug 28, 2024

gerbenvv commented Aug 28, 2024 •

edited

Loading

ArthurZucker commented Aug 29, 2024

	head_dim = config.head_dim if hasattr(config, "head_dim") else config.hidden_size // config.num_attention_heads
	head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)

Add a static cache that offloads to the CPU or other device #32161

Add a static cache that offloads to the CPU or other device #32161

Conversation

gerbenvv commented Jul 23, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Performance tests

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker commented Jul 26, 2024

HuggingFaceDocBuilderDev commented Jul 26, 2024

gante left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerbenvv Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerbenvv Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerbenvv commented Aug 6, 2024

gante left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Aug 28, 2024

gerbenvv commented Aug 28, 2024

gerbenvv commented Aug 28, 2024

gerbenvv commented Aug 28, 2024

gerbenvv commented Aug 28, 2024 • edited Loading

ArthurZucker commented Aug 29, 2024

gerbenvv commented Jul 23, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

gante left a comment •

edited

Loading

gerbenvv Aug 6, 2024 •

edited

Loading

ArthurZucker Aug 6, 2024 •

edited

Loading

gerbenvv Aug 7, 2024 •

edited

Loading

gerbenvv commented Aug 28, 2024 •

edited

Loading