-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2-72B w4a8 empty output #2392
Comments
Hi @lishicheng1996, does it only happen in w4a8_awq mode for Qwen model? Could you please try again without the padding operation? BTW, i opened the link of padding scripts you provided but found nothing. |
Thanks to your reply! I modified the link above to the issue contain padding scripts. |
I think it's caused by padding weights with zero. Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected. You can try to pad the weights like this: There may be some better ways, but just an example for you:) |
Thanks for your help, I’ll try it ^_^ |
Hi, using random weights for padding will lead to the different output compared to without padding? |
@lishicheng1996 Hi, Are you successful with random padding? I'm still getting empty output. |
System Info
GPU: 4090
Tensorrt: 10.3
tensorrt-llm: 0.13.0.dev2024081300
Who can help?
@Tracin May you please have a look, thank you very much
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, I tried Qwen2-72B w4a8 quantization, but got empty output. I do it with following steps:
Padding
Following script here #1833
Quantization Command:
python TensorRT-LLM/examples/quantization/quantize.py --model_dir Qwen2-72B-Instruct-padding/ --qformat w4a8_awq --output_dir w4a8_ckpt
Build Engine
trtllm-build --checkpoint_dir w4a8_ckpt --output_dir w4a8_engine --gemm_plugin auto
Test output
python TensorRT-LLM/examples/run.py --max_output_len=50 --tokenizer_dir ./Qwen2-72B-Instruct-padding/ --engine_dir=w4a8_engine
Expected behavior
Normally generate outputs like fp8 or int4_awq
actual behavior
Empty outpus
additional notes
None
The text was updated successfully, but these errors were encountered: