Qwen2-72B w4a8 empty output #2392

lishicheng1996 · 2024-10-30T08:55:37Z

System Info

GPU: 4090
Tensorrt: 10.3
tensorrt-llm: 0.13.0.dev2024081300

Who can help?

@Tracin May you please have a look, thank you very much

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hi, I tried Qwen2-72B w4a8 quantization, but got empty output. I do it with following steps:

Padding

Following script here #1833

Quantization Command:

python TensorRT-LLM/examples/quantization/quantize.py --model_dir Qwen2-72B-Instruct-padding/ --qformat w4a8_awq --output_dir w4a8_ckpt

Build Engine

trtllm-build --checkpoint_dir w4a8_ckpt --output_dir w4a8_engine --gemm_plugin auto

Test output

python TensorRT-LLM/examples/run.py --max_output_len=50 --tokenizer_dir ./Qwen2-72B-Instruct-padding/ --engine_dir=w4a8_engine

Expected behavior

Normally generate outputs like fp8 or int4_awq

actual behavior

Empty outpus

additional notes

None

The text was updated successfully, but these errors were encountered:

heyuhhh · 2024-11-01T08:49:57Z

Hi @lishicheng1996, does it only happen in w4a8_awq mode for Qwen model? Could you please try again without the padding operation?

BTW, i opened the link of padding scripts you provided but found nothing.

lishicheng1996 · 2024-11-04T11:27:55Z

Thanks to your reply! I modified the link above to the issue contain padding scripts.
The reason to padding is that I want run 2 cards TP on 4090 for this model. The internel kernel need tensor size to be Nx128. The intermediate_size of qwen2-72b is 29568, which is 115.5*2(TP)*128. So I have to padding it to build engine.

heyuhhh · 2024-11-05T13:06:05Z

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this:
torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

lishicheng1996 · 2024-11-05T13:24:06Z

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this: torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

Thanks for your help, I’ll try it ^_^

calico-niko · 2024-11-29T02:28:13Z

I think it's caused by padding weights with zero.

Padding with 0 doesn't effect the process of computation in theory. However, when you tried w4a8_awq quantization, which quantizes model by group (group_size=128), there will be some groups are full of zeros so that some calculated values for quantization are abnormal. It's worth noting that the values may be used for activations, which causes the activations abnormal, so the output of network is unexpected.

You can try to pad the weights like this: torch.randn([pad_size, shape_list[1]], dtype=value.dtype) * 0.001

There may be some better ways, but just an example for you:)

Hi, using random weights for padding will lead to the different output compared to without padding?
Thanks your help.

calico-niko · 2024-12-02T07:39:13Z

@lishicheng1996 Hi, Are you successful with random padding? I'm still getting empty output.

lishicheng1996 added the bug Something isn't working label Oct 30, 2024

hello-11 added triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Nov 1, 2024

hello-11 added the waiting for feedback label Nov 4, 2024

hello-11 assigned lishicheng1996 Nov 4, 2024

hello-11 added Investigating and removed waiting for feedback labels Nov 4, 2024

hello-11 added waiting for feedback and removed Investigating waiting for feedback labels Nov 6, 2024

hello-11 assigned heyuhhh and unassigned lishicheng1996 Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2-72B w4a8 empty output #2392

Qwen2-72B w4a8 empty output #2392

lishicheng1996 commented Oct 30, 2024 •

edited

Loading

heyuhhh commented Nov 1, 2024

lishicheng1996 commented Nov 4, 2024 •

edited

Loading

heyuhhh commented Nov 5, 2024

lishicheng1996 commented Nov 5, 2024

calico-niko commented Nov 29, 2024 •

edited

Loading

calico-niko commented Dec 2, 2024

Qwen2-72B w4a8 empty output #2392

Qwen2-72B w4a8 empty output #2392

Comments

lishicheng1996 commented Oct 30, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Padding

Quantization Command:

Build Engine

Test output

Expected behavior

actual behavior

additional notes

heyuhhh commented Nov 1, 2024

lishicheng1996 commented Nov 4, 2024 • edited Loading

heyuhhh commented Nov 5, 2024

lishicheng1996 commented Nov 5, 2024

calico-niko commented Nov 29, 2024 • edited Loading

calico-niko commented Dec 2, 2024

lishicheng1996 commented Oct 30, 2024 •

edited

Loading

lishicheng1996 commented Nov 4, 2024 •

edited

Loading

calico-niko commented Nov 29, 2024 •

edited

Loading