You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In AWQ inference, the quantized weight matrix is dequantized to fp16 and then multiplied by the input matrix x in the linear layer.
But I try to directly replace the fp16 matrix after the dequantization with the original weight matrix(llama2-7b-hf), the inference error will be particularly large. (According to the calculation formula, WX = (DQ(Q(Ws)s-1) X, [where DQ fuse scale s^-1]. In this case, the inverse quantized matrix should be equivalent to the original matrix W.)
where weight.T is the original weight matrix in fp16(llama2-7b-hf).
The Perplexity of wikitext2 is from 5.619 to 1324.6.
The text was updated successfully, but these errors were encountered:
grysgreat
changed the title
Error when substituting the quantized matrix multiplication operator.
Question: Error when substituting the quantized matrix multiplication operator.
Dec 4, 2024
Hi @grysgreat, this seems to be expected. You cannot recover the original fp16 with a transpose since you have lost a bunch of information when you quantize -> dequantize -> transpose.
Thanks for your answer, but I'm still curious as to which part of the AutoAWQ algorithm is responsible for not replacing the weight in the linear layer directly with the original weight(downloaded from hf)
(in AutoGPTQ, such a replacement operation can get the correct result - the same accuracy as Fp16).
In AWQ inference, the quantized weight matrix is dequantized to fp16 and then multiplied by the input matrix
x
in the linear layer.But I try to directly replace the fp16 matrix after the dequantization with the original weight matrix(llama2-7b-hf), the inference error will be particularly large. (According to the calculation formula,
WX = (DQ(Q(Ws)s-1) X
, [where DQ fuse scale s^-1]. In this case, the inverse quantized matrix should be equivalent to the original matrix W.)From:
to:
where
weight.T
is the original weight matrix in fp16(llama2-7b-hf).The Perplexity of wikitext2 is from
5.619
to1324.6
.The text was updated successfully, but these errors were encountered: