FIX GPTQModel Lora Wrapper #2404

Qubitium · 2025-02-28T05:32:05Z

PR Changes:

FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper
Skip Scaling multiply ops if Scaling == 1 (Unsure gpu is smart enough to no-op this micro optimizaton so doing this manually). The loras we are testing for GPTQ does not use Scale so scale = r / lora_alpha where r == lora_alpha

Notes:

GPTQLoraLinear copies most of the code and structure from AwqLoraLinear.

TODO:

Add CI test for GPTQmodel + Lora

Qubitium · 2025-02-28T05:54:47Z

@BenjaminBossan @SunMarc PR ready for review. My co-worker is writing up the peft ci-test for this but I want to get the review started early, if possible, before test is ready. Our GPTQmodel tests with PEFT and Lora test is passing.

This PR needs to to pair with GPTQModel PR: ModelCloud/GPTQModel#1358 which has a new ci test for lora. We are also rounding out the tests on our side and will merge and release v2.0 today or tomorrow.

SunMarc

Thanks ! Left a couple of comments

SunMarc · 2025-02-28T10:01:53Z

src/peft/tuners/lora/layer.py

+
+            if scaling == 1:  # no scaling
+                lora_output = lora_B(lora_A(dropout(sub_batch)))
+            else:
+                lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling
+


not sure to see the benefit of doing that

Following test show that pytorch can't optimize out the obvious no-op. Doing the * 1 math is 2x slower for the 3 shapes I tested to roughly simluate the logic. I think the same would be true if we ask the tensors to multiply by 0. It maybe be faster to zero all tensors then multiply by 0, or in this case a no-op of * 1

Benchmark Result on A100:

Benchmarking for tensor shape: (256, 256) Operation A: 0.000012 seconds per iteration Operation B: 0.000005 seconds per iteration ---------------------------------------- Benchmarking for tensor shape: (512, 512) Operation A: 0.000012 seconds per iteration Operation B: 0.000006 seconds per iteration ---------------------------------------- Benchmarking for tensor shape: (1024, 1024) Operation A: 0.000014 seconds per iteration Operation B: 0.000006 seconds per iteration ----------------------------------------

import torch import timeit tensor_shapes = [(256, 256), (512, 512), (1024, 1024)] repeats = 100 def benchmark_operation(operation, x, y): # Warm-up for _ in range(10): _ = operation(x, y) timer = timeit.Timer(lambda: operation(x, y)) time_taken = timer.timeit(number=repeats) / repeats return time_taken scale = 1 def operation_A(x, y): return x + y * scale def operation_B(x, y): if scale == 1: return x + y else: return x + y * scale CUDA = torch.device("cuda:0") for shape in tensor_shapes: print(f"Benchmarking for tensor shape: {shape}") x = torch.rand(shape).to(CUDA) y = torch.rand(shape).to(CUDA) # Benchmark A time_A = benchmark_operation(operation_A, x, y) print(f"Operation A: {time_A:.6f} seconds per iteration") # Benchmark B time_B = benchmark_operation(operation_B, x, y) print(f"Operation B: {time_B:.6f} seconds per iteration") print("-" * 40)

SunMarc · 2025-02-28T10:02:00Z

src/peft/tuners/lora/layer.py

+                    if scaling == 1:  # no scaling
+                        result = result + lora_B(lora_A(dropout(x)))
+                    else:
+                        result = result + lora_B(lora_A(dropout(x))) * scaling


Check above. Ugly code, but worth it, I think.

SunMarc · 2025-02-28T10:03:36Z

src/peft/tuners/lora/gptq.py

@@ -19,7 +19,7 @@
 from peft.import_utils import is_gptqmodel_available
 from peft.tuners.lora.layer import LoraLayer
 from peft.tuners.tuners_utils import BaseTunerLayer
-from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear


since you don't use get_gptqmodel_quant_linear anymore, you can remove the function

SunMarc · 2025-02-28T10:06:27Z

src/peft/tuners/lora/gptq.py

+class GPTQLoraLinear(torch.nn.Module, LoraLayer):
+    def __init__(
+        self,
+        base_layer,
+        adapter_name,
+        r: int = 0,
+        lora_alpha: int = 1,
+        lora_dropout: float = 0.0,
+        init_lora_weights: bool = True,
+        use_rslora: bool = False,
+        use_dora: bool = False,
+        lora_bias: bool = False,
+        **kwargs,
+    ):


what's the difference with Quantlinear ?

You're right. There may be no need to for the new GPTQLoraLinear cls. Checking.

Replaced with QuantLinear but I renamed to GPTQLoraLinear as the name is much clearer .

Qubitium · 2025-02-28T11:30:51Z

src/peft/tuners/lora/gptq.py

@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor):
            return result

        for active_adapter in self.active_adapters:
-            if active_adapter not in self.lora_A.keys():
+            if not self._adapter_in_lora_keys(active_adapter):
                continue


@SunMarc Found another micro optimization opportunity. In the loop in forward it is allocating an iterable keys every single time. I don't see anywhere that updates the self.lora_A so I adde a lru_cache to remove this allocation after first foward pass.

Qubitium · 2025-02-28T11:32:27Z

src/peft/tuners/lora/gptq.py

+            # lora_dropout float value is not stored so we need to check for cls
+            if isinstance(dropout, torch.nn.Dropout):
+                output = lora_B(lora_A(dropout(x)))
+            else:
+                # dropout == Identity which is no-op if lora_dropout == 0.0
+                output = lora_B(lora_A(x))


@SunMarc Another micro optimization no-op skip. lora_dropout: float is not stored so if 0.0 the update_layers assign an Identity module which does nothing no-op. Only do dropout if dropout is activated lora_dropout: float > 0.0

Qubitium added 2 commits February 28, 2025 01:01

test gptq lora

f5ce3fa

skip scaling ops if scaling == 1

65f9583

Qubitium marked this pull request as draft February 28, 2025 05:34

Qubitium changed the title ~~[WIP] FIX GPTQ Lora Wrapper~~ [WIP] FIX GPTQModel Lora Wrapper Feb 28, 2025

format

7bc4817

Qubitium changed the title ~~[WIP] FIX GPTQModel Lora Wrapper~~ FIX GPTQModel Lora Wrapper Feb 28, 2025

Qubitium marked this pull request as ready for review February 28, 2025 05:54

Qubitium mentioned this pull request Feb 28, 2025

Quant type != exllama when loading on GPU ModelCloud/GPTQModel#1354

Open

SunMarc reviewed Feb 28, 2025

View reviewed changes

use QuantLinear and rename to GPTQLoraLinear + micro optimizations

e0b069d

Qubitium commented Feb 28, 2025

View reviewed changes

optimize

c5b07bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX GPTQModel Lora Wrapper #2404

FIX GPTQModel Lora Wrapper #2404

Qubitium commented Feb 28, 2025 •

edited

Loading

Qubitium commented Feb 28, 2025 •

edited

Loading

SunMarc left a comment

SunMarc Feb 28, 2025

Qubitium Feb 28, 2025 •

edited

Loading

SunMarc Feb 28, 2025

Qubitium Feb 28, 2025 •

edited

Loading

SunMarc Feb 28, 2025

SunMarc Feb 28, 2025

Qubitium Feb 28, 2025

Qubitium Feb 28, 2025

Qubitium Feb 28, 2025

Qubitium Feb 28, 2025

FIX GPTQModel Lora Wrapper #2404

Are you sure you want to change the base?

FIX GPTQModel Lora Wrapper #2404

Conversation

Qubitium commented Feb 28, 2025 • edited Loading

Qubitium commented Feb 28, 2025 • edited Loading

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Qubitium commented Feb 28, 2025 •

edited

Loading

Qubitium commented Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading

Qubitium Feb 28, 2025 •

edited

Loading