Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility #29079

BlackSamorez · 2024-02-17T19:59:55Z

What does this PR do?

As of now, this PR fixes a small problem preventing one from using CUDA graph generation from #28937 with quantized models.

In the long run, It would be great to have compiled generation actually working for GPTQ/AQLM/other quantization methods.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

amyeroberts · 2024-02-19T11:23:43Z

cc @younesbelkada

younesbelkada

Thanks a lot for fixing !
For retrieving the correct device, the fix sounds correct.
However for the dtype, I am afraid this might lead to some bugs / unexpected behaviours 😭 As many users call perform text generation after calling some utility methods such as prepare_model_for_kbit_training (using PEFT), we do sometimes cast the layer norms in FP32. This is quite a niche usecase though. I propose to be on the safe zone and retrieve the dtype similarly as what we do here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L451-L457 - can you let me know if applying that logic here would fix CUDA graph generation for quantized models?
Also, can you elaborate a bit on the original issue, i.e. what you are trying to achieve and what bug do you get
Thanks !

BlackSamorez · 2024-02-22T10:00:24Z

@younesbelkada The error I get on main is quite simple:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], [line 22](vscode-notebook-cell:?execution_count=8&line=22)
     [19](vscode-notebook-cell:?execution_count=8&line=19)     return new_token
     [21](vscode-notebook-cell:?execution_count=8&line=21) with torch.no_grad():
---> [22](vscode-notebook-cell:?execution_count=8&line=22)     model._setup_cache(StaticCache, BS, max_cache_len=max_cache_length)
     [24](vscode-notebook-cell:?execution_count=8&line=24)     ### PREFILL
     [25](vscode-notebook-cell:?execution_count=8&line=25)     # input_pos = torch.arange(seq_length, device=device)
     [26](vscode-notebook-cell:?execution_count=8&line=26)     cache_position = torch.arange(seq_length , device=device)

File [~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:817](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:817), in LlamaPreTrainedModel._setup_cache(self, cache_cls, max_batch_size, max_cache_len)
    [814](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:814)     self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
    [816](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:816) for layer in self.model.layers:
--> [817](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:817)     weights = layer.self_attn.o_proj.weight
    [818](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:818)     layer.self_attn.past_key_value = cache_cls(
    [819](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:819)         self.config, max_batch_size, max_cache_len, device=weights.device, dtype=weights.dtype
    [820](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:820)     )

File [~/AQLM/.conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1706](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1706), in Module.__getattr__(self, name)
   [1704](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1704)     if name in modules:
   [1705](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1705)         return modules[name]
-> [1706](https://vscode-remote+ssh-002dremote-002bdas8gpu4.vscode-resource.vscode-cdn.net/nfs/scistore14/alistgrp/apanfero/AQLM/~/AQLM/.conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1706) raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'QuantizedLinear' object has no attribute 'weight'

That's why, I believe, it would make sense to source device and dtype from outside the linear layer to be quantization-agnostic.

BlackSamorez · 2024-02-22T10:01:00Z

And I don't really understand what you're proposing with the code you referenced.

younesbelkada · 2024-02-22T10:03:04Z

@BlackSamorez thanks!
Sorry I sent the wrong link, it should be:

transformers/src/transformers/models/llama/modeling_llama.py

Line 473 in 2a9b1f8

target_dtype = self.config._pre_quantization_dtype

you can do something like:

if hasattr(self.config, "_pre_quantization_dtype")
    target_dtype = self.config._pre_quantization_dtype
else:
    target_dtype = layer.self_attn.o_proj.weight

Does that fixes the issue?

BlackSamorez · 2024-02-22T10:19:57Z

What you proposed seems to work fine with both FP16 and AQLM with a notebook test based off @ArthurZucker's test script.
With what other models shall I try and run it?

younesbelkada

Amazing work !
Could you add a simple test in the aqlm testing file to test that usecase 🙏

HuggingFaceDocBuilderDev · 2024-02-22T10:53:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BlackSamorez · 2024-02-22T14:28:10Z

@younesbelkada CUDA graph generation diverges at some point:
Prefix: Hello my name is
Normal: Hello my name is Katie. I am a 20 year old college student. I am a very outgoing person. I love to have fun and be active. I am very easy going and love to make
CUDA graph: Hello my name is Katie. I am a 20 year old college student. I am a very outgoing person. I love to have fun and be active. I am a very hard worker and I

A stupid solution would be to generate shorter texts but I'm not sure if it's a good idea to have unstable tests.

P.S. As you might have guessed, I added a CUDA graph generation test for AQLM.

ArthurZucker

Looks very good to me!
Do you have any numbers to share regarding benchmakr?

src/transformers/models/llama/modeling_llama.py

ArthurZucker · 2024-02-23T08:30:44Z

tests/quantization/aqlm_integration/test_aqlm.py

+    @unittest.skipUnless(
+        is_aqlm_available() and version.parse(importlib.metadata.version("aqlm")) >= version.parse("1.0.3"),
+        "test requires `aqlm>=1.0.3`",
+    )
+    def test_quantized_model_compile(self):


Loving this test ❤️

@ArthurZucker @BlackSamorez The problem with it that it's failing :) . See this. So, advice needed on what to do here

I don't think it is super important that the outputs match for quantized models no? Distributions are the same, but kernels / ops are not run in the same order. It's small but could explain this?
Would just add a long generation and make sure it still makes sense!

I don't really know how to automatically check if text makes sense.
Alternatively, I've shortened the generation length from 40 tokens to 32 and it matches perfectly on RTX 3090, RTX 2080ti and a6000. Maybe we could just leave it as is since the tests above are exact match anyway.
(Current iteration tests pass)

fine with me 😉

younesbelkada

Amazing work @BlackSamorez !

ArthurZucker · 2024-02-27T05:56:56Z

tests/quantization/aqlm_integration/test_aqlm.py

+    @unittest.skipUnless(
+        is_aqlm_available() and version.parse(importlib.metadata.version("aqlm")) >= version.parse("1.0.3"),
+        "test requires `aqlm>=1.0.3`",
+    )
+    def test_quantized_model_compile(self):


fine with me 😉

…on for quantizers compatibility (#29079) * input_layernorm as the beacon of hope * cleaner dtype extraction * AQLM + CUDA graph test * is available check * shorter text test

input_layernorm as the beacon of hope

f93e730

younesbelkada reviewed Feb 19, 2024

View reviewed changes

BlackSamorez added 2 commits February 22, 2024 11:14

cleaner dtype extraction

a956ec8

Merge branch 'main' into setup_cache_fix

042c2d5

BlackSamorez changed the title ~~Quantization support for CUDA graph generation.~~ Cleaner Cache dtype and device extraction for CUDA graph generation for quantizers compatibility Feb 22, 2024

younesbelkada approved these changes Feb 22, 2024

View reviewed changes

younesbelkada requested a review from amyeroberts February 22, 2024 12:44

AQLM + CUDA graph test

edebc72

is available check

39d7603

ArthurZucker approved these changes Feb 23, 2024

View reviewed changes

younesbelkada approved these changes Feb 23, 2024

View reviewed changes

shorter text test

1c0adb8

ArthurZucker approved these changes Feb 27, 2024

View reviewed changes

younesbelkada merged commit e3fc90a into huggingface:main Feb 27, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility #29079

Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility #29079

BlackSamorez commented Feb 17, 2024

amyeroberts commented Feb 19, 2024

younesbelkada left a comment •

edited

Loading

BlackSamorez commented Feb 22, 2024 •

edited

Loading

BlackSamorez commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

BlackSamorez commented Feb 22, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Feb 22, 2024

BlackSamorez commented Feb 22, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Feb 23, 2024

BlackSamorez Feb 23, 2024 •

edited

Loading

ArthurZucker Feb 23, 2024

BlackSamorez Feb 23, 2024 •

edited

Loading

ArthurZucker Feb 27, 2024

younesbelkada left a comment

ArthurZucker Feb 27, 2024

Cleaner Cache dtype and device extraction for CUDA graph generation for quantizers compatibility #29079

Cleaner Cache dtype and device extraction for CUDA graph generation for quantizers compatibility #29079

Conversation

BlackSamorez commented Feb 17, 2024

What does this PR do?

Before submitting

Who can review?

amyeroberts commented Feb 19, 2024

younesbelkada left a comment • edited Loading

Choose a reason for hiding this comment

BlackSamorez commented Feb 22, 2024 • edited Loading

BlackSamorez commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

BlackSamorez commented Feb 22, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 22, 2024

BlackSamorez commented Feb 22, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 23, 2024

Choose a reason for hiding this comment

BlackSamorez Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Feb 23, 2024

Choose a reason for hiding this comment

BlackSamorez Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Feb 27, 2024

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 27, 2024

Choose a reason for hiding this comment

Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility #29079

Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility #29079

younesbelkada left a comment •

edited

Loading

BlackSamorez commented Feb 22, 2024 •

edited

Loading

BlackSamorez commented Feb 22, 2024 •

edited

Loading

BlackSamorez Feb 23, 2024 •

edited

Loading

BlackSamorez Feb 23, 2024 •

edited

Loading