-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX GPTQModel Lora Wrapper #2404
base: main
Are you sure you want to change the base?
Conversation
@BenjaminBossan @SunMarc PR ready for review. My co-worker is writing up the peft ci-test for this but I want to get the review started early, if possible, before test is ready. Our GPTQmodel tests with PEFT and Lora test is passing. This PR needs to to pair with GPTQModel PR: ModelCloud/GPTQModel#1358 which has a new ci test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Left a couple of comments
|
||
if scaling == 1: # no scaling | ||
lora_output = lora_B(lora_A(dropout(sub_batch))) | ||
else: | ||
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to see the benefit of doing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following test show that pytorch can't optimize out the obvious no-op. Doing the * 1 math is 2x slower for the 3 shapes I tested to roughly simluate the logic. I think the same would be true if we ask the tensors to multiply by 0. It maybe be faster to zero all tensors then multiply by 0, or in this case a no-op of * 1
Benchmark Result on A100:
Benchmarking for tensor shape: (256, 256)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000005 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (512, 512)
Operation A: 0.000012 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
Benchmarking for tensor shape: (1024, 1024)
Operation A: 0.000014 seconds per iteration
Operation B: 0.000006 seconds per iteration
----------------------------------------
import torch
import timeit
tensor_shapes = [(256, 256), (512, 512), (1024, 1024)]
repeats = 100
def benchmark_operation(operation, x, y):
# Warm-up
for _ in range(10):
_ = operation(x, y)
timer = timeit.Timer(lambda: operation(x, y))
time_taken = timer.timeit(number=repeats) / repeats
return time_taken
scale = 1
def operation_A(x, y):
return x + y * scale
def operation_B(x, y):
if scale == 1:
return x + y
else:
return x + y * scale
CUDA = torch.device("cuda:0")
for shape in tensor_shapes:
print(f"Benchmarking for tensor shape: {shape}")
x = torch.rand(shape).to(CUDA)
y = torch.rand(shape).to(CUDA)
# Benchmark A
time_A = benchmark_operation(operation_A, x, y)
print(f"Operation A: {time_A:.6f} seconds per iteration")
# Benchmark B
time_B = benchmark_operation(operation_B, x, y)
print(f"Operation B: {time_B:.6f} seconds per iteration")
print("-" * 40)
src/peft/tuners/lora/layer.py
Outdated
if scaling == 1: # no scaling | ||
result = result + lora_B(lora_A(dropout(x))) | ||
else: | ||
result = result + lora_B(lora_A(dropout(x))) * scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check above. Ugly code, but worth it, I think.
@@ -19,7 +19,7 @@ | |||
from peft.import_utils import is_gptqmodel_available | |||
from peft.tuners.lora.layer import LoraLayer | |||
from peft.tuners.tuners_utils import BaseTunerLayer | |||
from peft.utils import get_auto_gptq_quant_linear, get_gptqmodel_quant_linear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you don't use get_gptqmodel_quant_linear anymore, you can remove the function
src/peft/tuners/lora/gptq.py
Outdated
class GPTQLoraLinear(torch.nn.Module, LoraLayer): | ||
def __init__( | ||
self, | ||
base_layer, | ||
adapter_name, | ||
r: int = 0, | ||
lora_alpha: int = 1, | ||
lora_dropout: float = 0.0, | ||
init_lora_weights: bool = True, | ||
use_rslora: bool = False, | ||
use_dora: bool = False, | ||
lora_bias: bool = False, | ||
**kwargs, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference with Quantlinear ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. There may be no need to for the new GPTQLoraLinear cls. Checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced with QuantLinear
but I renamed to GPTQLoraLinear
as the name is much clearer .
@@ -65,8 +70,9 @@ def forward(self, x: torch.Tensor): | |||
return result | |||
|
|||
for active_adapter in self.active_adapters: | |||
if active_adapter not in self.lora_A.keys(): | |||
if not self._adapter_in_lora_keys(active_adapter): | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc Found another micro optimization opportunity. In the loop in forward
it is allocating an iterable keys
every single time. I don't see anywhere that updates the self.lora_A
so I adde a lru_cache
to remove this allocation after first foward pass.
# lora_dropout float value is not stored so we need to check for cls | ||
if isinstance(dropout, torch.nn.Dropout): | ||
output = lora_B(lora_A(dropout(x))) | ||
else: | ||
# dropout == Identity which is no-op if lora_dropout == 0.0 | ||
output = lora_B(lora_A(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc Another micro optimization no-op skip. lora_dropout: float
is not stored so if 0.0
the update_layers
assign an Identity
module which does nothing no-op. Only do dropout if dropout is activated lora_dropout: float > 0.0
PR Changes:
Scaling
multiply ops ifScaling == 1
(Unsure gpu is smart enough to no-op this micro optimizaton so doing this manually). The loras we are testing for GPTQ does not use Scale soscale = r / lora_alpha
wherer == lora_alpha
Notes:
GPTQLoraLinear
copies most of the code and structure fromAwqLoraLinear
.TODO: