Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Error when substituting the quantized matrix multiplication operator. #670

Open
grysgreat opened this issue Dec 4, 2024 · 2 comments

Comments

@grysgreat
Copy link

grysgreat commented Dec 4, 2024

In AWQ inference, the quantized weight matrix is dequantized to fp16 and then multiplied by the input matrix x in the linear layer.

But I try to directly replace the fp16 matrix after the dequantization with the original weight matrix(llama2-7b-hf), the inference error will be particularly large. (According to the calculation formula, WX = (DQ(Q(Ws)s-1) X, [where DQ fuse scale s^-1]. In this case, the inverse quantized matrix should be equivalent to the original matrix W.)

From:

            out = dequantize_gemm(qweight, qzeros, scales, w_bit, group_size)
            out = torch.matmul(x, out)

to:

            out = weight.T
            out = torch.matmul(x, out)

where weight.T is the original weight matrix in fp16(llama2-7b-hf).

The Perplexity of wikitext2 is from 5.619 to 1324.6.

@grysgreat grysgreat changed the title Error when substituting the quantized matrix multiplication operator. Question: Error when substituting the quantized matrix multiplication operator. Dec 4, 2024
@casper-hansen
Copy link
Owner

Hi @grysgreat, this seems to be expected. You cannot recover the original fp16 with a transpose since you have lost a bunch of information when you quantize -> dequantize -> transpose.

@grysgreat
Copy link
Author

Thanks for your answer, but I'm still curious as to which part of the AutoAWQ algorithm is responsible for not replacing the weight in the linear layer directly with the original weight(downloaded from hf)
(in AutoGPTQ, such a replacement operation can get the correct result - the same accuracy as Fp16).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants