Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-72B w4a8 empty output #2392

Open
2 of 4 tasks
lishicheng1996 opened this issue Oct 30, 2024 · 6 comments
Open
2 of 4 tasks

Qwen2-72B w4a8 empty output #2392

lishicheng1996 opened this issue Oct 30, 2024 · 6 comments
Assignees
Labels
bug Something isn't working Low Precision Issue about lower bit quantization, including int8, int4, fp8 triaged Issue has been triaged by maintainers

Comments

@lishicheng1996
Copy link

lishicheng1996 commented Oct 30, 2024

System Info

GPU: 4090
Tensorrt: 10.3
tensorrt-llm: 0.13.0.dev2024081300

Who can help?

@Tracin May you please have a look, thank you very much

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi, I tried Qwen2-72B w4a8 quantization, but got empty output. I do it with following steps:

Padding

Following script here #1833

Quantization Command:

python TensorRT-LLM/examples/quantization/quantize.py --model_dir Qwen2-72B-Instruct-padding/ --qformat w4a8_awq --output_dir w4a8_ckpt

Build Engine

trtllm-build --checkpoint_dir w4a8_ckpt --output_dir w4a8_engine --gemm_plugin auto

Test output

python TensorRT-LLM/examples/run.py --max_output_len=50 --tokenizer_dir ./Qwen2-72B-Instruct-padding/ --engine_dir=w4a8_engine

Expected behavior

Normally generate outputs like fp8 or int4_awq

actual behavior

Empty outpus

additional notes

None

@lishicheng1996 lishicheng1996 added the bug Something isn't working label Oct 30, 2024
@hello-11 hello-11 added triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Nov 1, 2024
@heyuhhh
Copy link

heyuhhh commented Nov 1, 2024

Hi @lishicheng1996, does it only happen in w4a8_awq mode for Qwen model? Could you please try again without the padding operation?

BTW, i opened the link of padding scripts you provided but found nothing.

@lishicheng1996
Copy link
Author

lishicheng1996 commented Nov 4, 2024

Thanks to your reply! I modified the link above to the issue contain padding scripts.
The reason to padding is that I want run 2 cards TP on 4090 for this model. The internel kernel need tensor size to be Nx128. The intermediate_size of qwen2-72b is 29568, which is 115.5*2(TP)*128. So I have to padding it to build engine.

@heyuhhh
Copy link

heyuhhh commented Nov 5, 2024

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this:
torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

@lishicheng1996
Copy link
Author

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this: torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

Thanks for your help, I’ll try it ^_^

@calico-niko
Copy link

calico-niko commented Nov 29, 2024

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this: torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

Hi, using random weights for padding will lead to the different output compared to without padding?
Thanks your help.

@calico-niko
Copy link

@lishicheng1996 Hi, Are you successful with random padding? I'm still getting empty output.

@hello-11 hello-11 assigned heyuhhh and unassigned lishicheng1996 Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Low Precision Issue about lower bit quantization, including int8, int4, fp8 triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants