qserve is slower then awq int4 for llama2-7b on H100 #2509

anaivebird · 2024-11-28T10:00:10Z

System Info

GPU： NVIDIA H100 80G
TensorRT-LLM branch main
TensorRT-LLM commit: 535c9cc

performance results

qserve result:

Successful Request 359
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1737.53
Avg_Gen_Token_Len 1000.3
Elapse_Time (s) 226.188
Time_to_First_Token_AVG (s) 9.957
Time_to_First_Token_P99 (s) 30.965
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.03
Latency_P90 (s) 57.549
Latency_P95 (s) 58.187
Latency_P99 (s) 61.007
Latency_AVG (s) 34.043
Token QPS (token/s) 1587.65
Service QPS (req/s) 1.59

Successful Request 208
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1802.95
Avg_Gen_Token_Len 994.21
Elapse_Time (s) 135.085
Time_to_First_Token_AVG (s) 36.664
Time_to_First_Token_P99 (s) 62.527
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.045
Latency_P90 (s) 88.988
Latency_P95 (s) 90.888
Latency_P99 (s) 92.339
Latency_AVG (s) 33.051
Token QPS (token/s) 1530.85
Service QPS (req/s) 1.54

awq result:

Successful Request 369
Request_Gen_Token_Len 1024
Batch Size 64
Avg_Input_Token_Len 1726.56
Avg_Gen_Token_Len 952.3
Elapse_Time (s) 212.125
Time_to_First_Token_AVG (s) 8.244
Time_to_First_Token_P99 (s) 29.357
Time_per_Output_Token_AVG (s) 0.029
Time_per_Output_Token_P99 (s) 0.062
Latency_P90 (s) 53.352
Latency_P95 (s) 55.721
Latency_P99 (s) 58.419
Latency_AVG (s) 31.806
Token QPS (token/s) 1656.56
Service QPS (req/s) 1.74

Successful Request 177
Request_Gen_Token_Len 1024
Batch Size 128
Avg_Input_Token_Len 1804.7
Avg_Gen_Token_Len 931.08
Elapse_Time (s) 105.276
Time_to_First_Token_AVG (s) 30.793
Time_to_First_Token_P99 (s) 59.689
Time_per_Output_Token_AVG (s) 0.028
Time_per_Output_Token_P99 (s) 0.072
Latency_P90 (s) 72.126
Latency_P95 (s) 86.212
Latency_P99 (s) 88.854
Latency_AVG (s) 24.425
Token QPS (token/s) 1565.43
Service QPS (req/s) 1.68

build commands:

#qserve engine build

git clone https://github.com/mit-han-lab/deepcompressor
cd deepcompressor
git checkout lmquant-v0.0.0-deprecated
export PATH="/root/miniconda3/bin:$PATH"
source activate base
conda env create -f environment.yml -n lmquant
conda activate lmquant
poetry install
cd /root/deepcompressor/projects/llm
nohup python -m lmquant.llm.run \
    configs/llm.yaml configs/qoq/g128.yaml \
    --model-name llama2-7b --model-path /root/llama2-7b \
    --smooth-xw-alpha 0 --smooth-xw-beta 1 \
    --smooth-yx-alpha 0.5 --smooth-yx-beta 0 \
    --save-model &


cd /app/tensorrt_llm/examples/llama
export TRTLLM_DISABLE_UNIFIED_CONVERTER=1
python convert_checkpoint.py --model_dir /root/llama2-7b \
                             --output_dir /root/trtllm-llama2-7b  \
                             --dtype float16  \
                             --quant_ckpt_path  /root/quant-llama2-7b \
                             --use_qserve  \
                             --per_group  \
                             --tp_size 1

trtllm-build --checkpoint_dir /root/trtllm-llama2-7b \
            --output_dir /root/engine-llama2-7b \
            --gemm_plugin auto


#awq int4 engine build

convert_script=../llama/convert_checkpoint.py
quantize_script=../quantization/quantize.py
model_dir=/root/llama2-7b
output_dir=/root/awq-llama2-7b
tp=1
python3 ../quantization/quantize.py --model_dir ${model_dir} \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
                                   --calib_size 128 \
                                   --batch_size 1 \
                                   --calib_max_seq_length 2048

trtllm-build --checkpoint_dir $output_dir/llama-checkpoint-awq-int4-${tp}gpu/ \
             --output_dir $output_dir/llama-trt-engine-awq-int4-${tp}gpu/ \
                         --gemm_plugin float16 \
                         --use_paged_context_fmha enable \
                         --max_num_tokens 13120 \
                         --max_seq_len 4096 \
                         --max_batch_size 128

The text was updated successfully, but these errors were encountered:

anaivebird · 2024-11-29T03:20:57Z

both per channel and per group qserve is slower than awq

batch size	qserve per group	qserve per channel	awq
4	no test	514.54	602.91
64	1587.65	1675.41	1656.56
128	1530.85	1660.44	1565.43

bobboli · 2024-12-02T04:43:02Z

Hi,
Currently QServe kernels are not fully utilizing the hardware features of Hopper architecture. You could try on Ampere or Ada cards if available.

anaivebird changed the title ~~qserve with tensorrt-llm is slower and awq int4 for llama2-7b~~ qserve group 128 with tensorrt-llm is slower and awq int4 for llama2-7b Nov 28, 2024

anaivebird changed the title ~~qserve group 128 with tensorrt-llm is slower and awq int4 for llama2-7b~~ qserve is slower then awq int4 for llama2-7b on H100 Nov 29, 2024

hello-11 added the Performance Issue about performance number label Dec 2, 2024

hello-11 assigned bobboli Dec 10, 2024

hello-11 added the triaged Issue has been triaged by maintainers label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qserve is slower then awq int4 for llama2-7b on H100 #2509

qserve is slower then awq int4 for llama2-7b on H100 #2509

anaivebird commented Nov 28, 2024 •

edited

Loading

anaivebird commented Nov 29, 2024 •

edited

Loading

bobboli commented Dec 2, 2024

qserve is slower then awq int4 for llama2-7b on H100 #2509

qserve is slower then awq int4 for llama2-7b on H100 #2509

Comments

anaivebird commented Nov 28, 2024 • edited Loading

System Info

performance results

build commands:

anaivebird commented Nov 29, 2024 • edited Loading

bobboli commented Dec 2, 2024

anaivebird commented Nov 28, 2024 •

edited

Loading

anaivebird commented Nov 29, 2024 •

edited

Loading