Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX GPTQModel Lora Wrapper #2404

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Feb 28, 2025

PR Changes:

  • FIX GPTQ linear layers from GPTQModel is not compatible with PEFT Lora wrapper
  • Skip Scaling multiply ops if Scaling == 1 (Unsure gpu is smart enough to no-op this micro optimizaton so doing this manually). The loras we are testing for GPTQ does not use Scale so scale = r / lora_alpha where r == lora_alpha

Notes:

  • GPTQLoraLinear copies most of the code and structure from AwqLoraLinear.

TODO:

  • Add CI test for GPTQmodel + Lora

@Qubitium Qubitium marked this pull request as draft February 28, 2025 05:34
@Qubitium Qubitium changed the title [WIP] FIX GPTQ Lora Wrapper [WIP] FIX GPTQModel Lora Wrapper Feb 28, 2025
@Qubitium Qubitium changed the title [WIP] FIX GPTQModel Lora Wrapper FIX GPTQModel Lora Wrapper Feb 28, 2025
@Qubitium
Copy link
Contributor Author

Qubitium commented Feb 28, 2025

@BenjaminBossan @SunMarc PR ready for review. My co-worker is writing up the peft ci-test for this but I want to get the review started early, if possible, before test is ready. Our GPTQmodel tests with PEFT and Lora test is passing.

This PR needs to to pair with GPTQModel PR: ModelCloud/GPTQModel#1358 which has a new ci test for lora. We are also rounding out the tests on our side and will merge and release v2.0 today or tomorrow.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Left a couple of comments

Comment on lines +492 to +497

if scaling == 1: # no scaling
lora_output = lora_B(lora_A(dropout(sub_batch)))
else:
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to see the benefit of doing that

Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following test show that pytorch can't optimize out the obvious no-op. Doing the * 1 math is 2x slower for the 3 shapes I tested to roughly simluate the logic. I think the same would be true if we ask the tensors to multiply by 0. It maybe be faster to zero all tensors then multiply by 0, or in this case a no-op of * 1

Benchmark Result on A100:

Benchmarking for tensor shape: (256, 256)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000005 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (512, 512)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (1024, 1024)
Operation A: 0.000014 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
import torch
import timeit

tensor_shapes = [(256, 256), (512, 512), (1024, 1024)]
repeats = 100

def benchmark_operation(operation, x, y):
    # Warm-up
    for _ in range(10):
        _ = operation(x, y)

    timer = timeit.Timer(lambda: operation(x, y))
    time_taken = timer.timeit(number=repeats) / repeats
    return time_taken

scale = 1
def operation_A(x, y):
    return x + y * scale

def operation_B(x, y):
    if scale == 1:
        return x + y
    else:
        return x + y * scale

CUDA = torch.device("cuda:0")

for shape in tensor_shapes:
    print(f"Benchmarking for tensor shape: {shape}")

    x = torch.rand(shape).to(CUDA)
    y = torch.rand(shape).to(CUDA)

    # Benchmark A
    time_A = benchmark_operation(operation_A, x, y)
    print(f"Operation A: {time_A:.6f} seconds per iteration")

    # Benchmark B
    time_B = benchmark_operation(operation_B, x, y)
    print(f"Operation B: {time_B:.6f} seconds per iteration")

    print("-" * 40)

Comment on lines 729 to 732
if scaling == 1: # no scaling
result = result + lora_B(lora_A(dropout(x)))
else:
result = result + lora_B(lora_A(dropout(x))) * scaling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

@Qubitium Qubitium Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check above. Ugly code, but worth it, I think.

@@ -19,7 +19,7 @@
from peft.import_utils import is_gptqmodel_available
from peft.tuners.lora.layer import LoraLayer
from peft.tuners.tuners_utils import BaseTunerLayer
from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you don't use get_gptqmodel_quant_linear anymore, you can remove the function

Comment on lines 128 to 141
class GPTQLoraLinear(torch.nn.Module, LoraLayer):
def __init__(
self,
base_layer,
adapter_name,
r: int = 0,
lora_alpha: int = 1,
lora_dropout: float = 0.0,
init_lora_weights: bool = True,
use_rslora: bool = False,
use_dora: bool = False,
lora_bias: bool = False,
**kwargs,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference with Quantlinear ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. There may be no need to for the new GPTQLoraLinear cls. Checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with QuantLinear but I renamed to GPTQLoraLinear as the name is much clearer .

@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor):
return result

for active_adapter in self.active_adapters:
if active_adapter not in self.lora_A.keys():
if not self._adapter_in_lora_keys(active_adapter):
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Found another micro optimization opportunity. In the loop in forward it is allocating an iterable keys every single time. I don't see anywhere that updates the self.lora_A so I adde a lru_cache to remove this allocation after first foward pass.

Comment on lines +86 to +91
# lora_dropout float value is not stored so we need to check for cls
if isinstance(dropout, torch.nn.Dropout):
output = lora_B(lora_A(dropout(x)))
else:
# dropout == Identity which is no-op if lora_dropout == 0.0
output = lora_B(lora_A(x))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc Another micro optimization no-op skip. lora_dropout: float is not stored so if 0.0 the update_layers assign an Identity module which does nothing no-op. Only do dropout if dropout is activated lora_dropout: float > 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants