Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MLA/GQA/MQA Triton decoding #1138

Merged
merged 5 commits into from
Aug 19, 2024
Merged

Conversation

ispobock
Copy link
Collaborator

Motivation

Optimize memory access for MLA/GQA/MQA decoding.

Modification

One block handle BLOCK_H q heads with shared k/v head. Inspired by InternLM/lmdeploy#1649.

@zhyncs zhyncs self-assigned this Aug 17, 2024
@ispobock
Copy link
Collaborator Author

Tested on A100-80G:
DeepSeek-V2-Lite

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     5000
Benchmark duration (s):                  238.01
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1088588
Request throughput (req/s):              21.01
Input token throughput (tok/s):          4990.76
Output token throughput (tok/s):         4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   82822.78
Median E2E Latency (ms):                 79653.86
---------------Time to First Token----------------
Mean TTFT (ms):                          7167.67
Median TTFT (ms):                        4229.26
P99 TTFT (ms):                           21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1073.28
Median TPOT (ms):                        473.77
P99 TPOT (ms):                           7907.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           409.14
Median ITL (ms):                         165.46
P99 ITL (ms):                            1814.59
==================================================

subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535

Llama-3-8B

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     1000
Benchmark duration (s):                  49.48
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    198032
Request throughput (req/s):              20.21
Input token throughput (tok/s):          4324.67
Output token throughput (tok/s):         4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20671.67
Median E2E Latency (ms):                 19467.73
---------------Time to First Token----------------
Mean TTFT (ms):                          3234.54
Median TTFT (ms):                        1188.96
P99 TTFT (ms):                           14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.30
Median TPOT (ms):                        90.10
P99 TPOT (ms):                           1976.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           91.19
Median ITL (ms):                         61.85
P99 ITL (ms):                            308.93
==================================================

subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618

Reproduce:

python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

@zhyncs
Copy link
Member

zhyncs commented Aug 17, 2024

Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock

@ispobock
Copy link
Collaborator Author

Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock

@zhyncs Previous version reuses from L2 cache. This version reuses shared k/v head from SMEM.

@zhyncs
Copy link
Member

zhyncs commented Aug 17, 2024

Tested on A100-80G: DeepSeek-V2-Lite

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     5000
Benchmark duration (s):                  238.01
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1088588
Request throughput (req/s):              21.01
Input token throughput (tok/s):          4990.76
Output token throughput (tok/s):         4579.34
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   82822.78
Median E2E Latency (ms):                 79653.86
---------------Time to First Token----------------
Mean TTFT (ms):                          7167.67
Median TTFT (ms):                        4229.26
P99 TTFT (ms):                           21327.09
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1073.28
Median TPOT (ms):                        473.77
P99 TPOT (ms):                           7907.65
---------------Inter-token Latency----------------
Mean ITL (ms):                           409.14
Median ITL (ms):                         165.46
P99 ITL (ms):                            1814.59
==================================================

subject: abstract_algebra, #q:100, acc: 0.270
subject: anatomy, #q:135, acc: 0.504
subject: astronomy, #q:152, acc: 0.572
subject: business_ethics, #q:100, acc: 0.600
subject: clinical_knowledge, #q:265, acc: 0.642
subject: college_biology, #q:144, acc: 0.653
subject: college_chemistry, #q:100, acc: 0.410
subject: college_computer_science, #q:100, acc: 0.440
subject: college_mathematics, #q:100, acc: 0.380
subject: college_medicine, #q:173, acc: 0.601
Total latency: 33.251
Average accuracy: 0.535

Llama-3-8B

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    128.0
Successful requests:                     1000
Benchmark duration (s):                  49.48
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    198032
Request throughput (req/s):              20.21
Input token throughput (tok/s):          4324.67
Output token throughput (tok/s):         4037.53
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20671.67
Median E2E Latency (ms):                 19467.73
---------------Time to First Token----------------
Mean TTFT (ms):                          3234.54
Median TTFT (ms):                        1188.96
P99 TTFT (ms):                           14154.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.30
Median TPOT (ms):                        90.10
P99 TPOT (ms):                           1976.89
---------------Inter-token Latency----------------
Mean ITL (ms):                           91.19
Median ITL (ms):                         61.85
P99 ITL (ms):                            308.93
==================================================

subject: abstract_algebra, #q:100, acc: 0.330
subject: anatomy, #q:135, acc: 0.696
subject: astronomy, #q:152, acc: 0.684
subject: business_ethics, #q:100, acc: 0.630
subject: clinical_knowledge, #q:265, acc: 0.751
subject: college_biology, #q:144, acc: 0.771
subject: college_chemistry, #q:100, acc: 0.460
subject: college_computer_science, #q:100, acc: 0.520
subject: college_mathematics, #q:100, acc: 0.340
subject: college_medicine, #q:173, acc: 0.636
Total latency: 41.592
Average accuracy: 0.618

Reproduce:

python3 -m sglang.launch_server --model-path DeepSeek-V2-Lite --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer DeepSeek-V2-Lite --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

python3 -m sglang.launch_server --model-path Meta-Llama-3-8B --port 30000 --trust-remote-code --disable-radix-cache --disable-flashinfer --tp=1
python3 -m sglang.bench_serving --backend sglang --tokenizer Meta-Llama-3-8B --dataset-path /workdir/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 128
python3 benchmark/mmlu/bench_sglang.py --nsub 10

ref #905 (comment)

After a brief look, the throughput has roughly doubled compared to the previous MLA version, great work! cc @merrymercy @Ying1123 @hnyls2002

@zhyncs zhyncs requested a review from yzh119 August 17, 2024 16:24
Copy link
Member

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM @ispobock

Currently, all CIs have passed, including when Llama3 disables FlashInfer, it will go through some logic. The benchmark and eval of this PR also meet expectations. The verification of DeepSeek V2 on A100 TP8 and H100 TP8 can be done later, and try to continue analyzing whether there is room for optimization with nsys and ncu. After yesterday's simple discussion, it is mainly the quick implementation by @ispobock, also thanks a lot for the implementation reference by @grimoire InternLM/lmdeploy#1649 and discussion comments from @lzhangzz

@MARD1NO and @yzh119 , if you are interested, welcome to help review and give some optimization suggestions. Thanks.

cc @merrymercy @Ying1123 @hnyls2002

@zhyncs
Copy link
Member

zhyncs commented Aug 17, 2024

@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!

git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8

python3 -m sglang.bench_serving --backend sglang

@zhyncs zhyncs mentioned this pull request Aug 17, 2024
29 tasks
@81549361
Copy link

81549361 commented Aug 17, 2024

@Xu-Chen @lxww302 I noticed that you have used the implementation of SGLang's DeepSeek V2 TP8 MLA before. Could you help verify the performance of the new version, for example, on devices you have like A100 TP8, A800 TP8, H100 TP8, etc.? Thanks very mauch!

git clone -b decode_gqa_opt https://github.com/ispobock/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2 --port 30000 --trust-remote-code --disable-radix-cache --enable-mla --tp=8

python3 -m sglang.bench_serving --backend sglang

I have 8Xh100s, I executed your command

Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     1000      
Benchmark duration (s):                  182.31    
Total input tokens:                      236142    
Total generated tokens:                  215614    
Total generated tokens (retokenized):    215037    
Request throughput (req/s):              5.49      
Input token throughput (tok/s):          1295.28   
Output token throughput (tok/s):         1182.68   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75887.79  
Median E2E Latency (ms):                 77685.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          43446.36  
Median TTFT (ms):                        39279.88  
P99 TTFT (ms):                           104146.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          181.96    
Median TPOT (ms):                        161.47    
P99 TPOT (ms):                           653.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           152.74    
Median ITL (ms):                         99.26     
P99 ITL (ms):                            465.58    
==================================================```

@zhyncs
Copy link
Member

zhyncs commented Aug 17, 2024

Thanks! Is it H100 SXM or NVL? @81549361

@zhyncs
Copy link
Member

zhyncs commented Aug 17, 2024

May you collect the env info with python3 -m sglang.check_env. @81549361

@vhain
Copy link
Contributor

vhain commented Aug 17, 2024

Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:

llmperf command used
python token_benchmark_ray.py \
  --model "${MODEL}" \
  --mean-input-tokens 1500 \
  --stddev-input-tokens 150 \
  --mean-output-tokens 245 \
  --stddev-output-tokens 20 \
  --max-num-completed-requests "64" \
  --timeout 7200 \
  --num-concurrent-requests "8" \
  --llm-api openai \
  --additional-sampling-params '{}'
main branch
{
    "version": "2023-08-31",
    "mean_input_tokens": 1500,
    "stddev_input_tokens": 150,
    "mean_output_tokens": 245,
    "stddev_output_tokens": 20,
    "num_concurrent_requests": 8,
    "results_inter_token_latency_s_quantiles_p25": 0.03990331099470551,
    "results_inter_token_latency_s_quantiles_p50": 0.057948063652443406,
    "results_inter_token_latency_s_quantiles_p75": 0.08040066503004678,
    "results_inter_token_latency_s_quantiles_p90": 0.08383243498141633,
    "results_inter_token_latency_s_quantiles_p95": 0.08516111126646178,
    "results_inter_token_latency_s_quantiles_p99": 0.10164050496592587,
    "results_inter_token_latency_s_mean": 0.06027883582796916,
    "results_inter_token_latency_s_min": 0.03675615620323733,
    "results_inter_token_latency_s_max": 0.1020314351556132,
    "results_inter_token_latency_s_stddev": 0.0211621866217624,
    "results_ttft_s_quantiles_p25": 0.4133454477414489,
    "results_ttft_s_quantiles_p50": 1.016814228380099,
    "results_ttft_s_quantiles_p75": 11.284791270736605,
    "results_ttft_s_quantiles_p90": 11.749069100199268,
    "results_ttft_s_quantiles_p95": 11.803535583987832,
    "results_ttft_s_quantiles_p99": 11.955875016311182,
    "results_ttft_s_mean": 5.338054827436281,
    "results_ttft_s_min": 0.2691499590873718,
    "results_ttft_s_max": 12.148427874781191,
    "results_ttft_s_stddev": 5.495650480946165,
    "results_end_to_end_latency_s_quantiles_p25": 11.498506030999124,
    "results_end_to_end_latency_s_quantiles_p50": 15.51382327103056,
    "results_end_to_end_latency_s_quantiles_p75": 22.9230548851192,
    "results_end_to_end_latency_s_quantiles_p90": 23.657817971240732,
    "results_end_to_end_latency_s_quantiles_p95": 23.97725157707464,
    "results_end_to_end_latency_s_quantiles_p99": 24.61372328522615,
    "results_end_to_end_latency_s_mean": 16.84320118615142,
    "results_end_to_end_latency_s_min": 3.5896931253373623,
    "results_end_to_end_latency_s_max": 25.067169249989092,
    "results_end_to_end_latency_s_stddev": 6.076063540076458,
    "results_request_output_throughput_token_per_s_quantiles_p25": 12.432897921487776,
    "results_request_output_throughput_token_per_s_quantiles_p50": 17.950591526918625,
    "results_request_output_throughput_token_per_s_quantiles_p75": 25.023589881617227,
    "results_request_output_throughput_token_per_s_quantiles_p90": 25.61754857375858,
    "results_request_output_throughput_token_per_s_quantiles_p95": 26.080372795146523,
    "results_request_output_throughput_token_per_s_quantiles_p99": 27.12744569799552,
    "results_request_output_throughput_token_per_s_mean": 18.7890127702506,
    "results_request_output_throughput_token_per_s_min": 9.773737854436295,
    "results_request_output_throughput_token_per_s_max": 27.204481327432568,
    "results_request_output_throughput_token_per_s_stddev": 6.462698432888159,
    "results_number_input_tokens_quantiles_p25": 1419.75,
    "results_number_input_tokens_quantiles_p50": 1513.5,
    "results_number_input_tokens_quantiles_p75": 1585.25,
    "results_number_input_tokens_quantiles_p90": 1726.1000000000001,
    "results_number_input_tokens_quantiles_p95": 1812.2499999999998,
    "results_number_input_tokens_quantiles_p99": 1942.5299999999997,
    "results_number_input_tokens_mean": 1515.53125,
    "results_number_input_tokens_min": "1125",
    "results_number_input_tokens_max": "1986",
    "results_number_input_tokens_stddev": 157.1251617922921,
    "results_number_output_tokens_quantiles_p25": 271.25,
    "results_number_output_tokens_quantiles_p50": 287.0,
    "results_number_output_tokens_quantiles_p75": 304.5,
    "results_number_output_tokens_quantiles_p90": 318.0,
    "results_number_output_tokens_quantiles_p95": 326.4,
    "results_number_output_tokens_quantiles_p99": 340.37,
    "results_number_output_tokens_mean": 280.546875,
    "results_number_output_tokens_min": "78",
    "results_number_output_tokens_max": "341",
    "results_number_output_tokens_stddev": 43.62427229119711,
    "results_num_requests_started": 64,
    "results_error_rate": 0.0,
    "results_number_errors": 0,
    "results_error_code_frequency": "{}",
    "results_mean_output_throughput_token_per_s": 122.91809365087381,
    "results_num_completed_requests": 64,
    "results_num_completed_requests_per_min": 26.288247263678944,
    "timestamp": 1723922364
}
incoming branch
{
    "version": "2023-08-31",
    "mean_input_tokens": 1500,
    "stddev_input_tokens": 150,
    "mean_output_tokens": 245,
    "stddev_output_tokens": 20,
    "num_concurrent_requests": 8,
    "results_inter_token_latency_s_quantiles_p25": 0.04048058146969138,
    "results_inter_token_latency_s_quantiles_p50": 0.04134249718749723,
    "results_inter_token_latency_s_quantiles_p75": 0.042773683461634744,
    "results_inter_token_latency_s_quantiles_p90": 0.04477736409998821,
    "results_inter_token_latency_s_quantiles_p95": 0.04621570852103804,
    "results_inter_token_latency_s_quantiles_p99": 0.04943066709057319,
    "results_inter_token_latency_s_mean": 0.04202164194913325,
    "results_inter_token_latency_s_min": 0.03828613981456747,
    "results_inter_token_latency_s_max": 0.05096760665209523,
    "results_inter_token_latency_s_stddev": 0.0023344492257422154,
    "results_ttft_s_quantiles_p25": 0.3779949996387586,
    "results_ttft_s_quantiles_p50": 0.403224729700014,
    "results_ttft_s_quantiles_p75": 0.44007199979387224,
    "results_ttft_s_quantiles_p90": 0.4766438877210021,
    "results_ttft_s_quantiles_p95": 0.4872294148663059,
    "results_ttft_s_quantiles_p99": 0.49447528753429654,
    "results_ttft_s_mean": 0.4035295032663271,
    "results_ttft_s_min": 0.2787872082553804,
    "results_ttft_s_max": 0.49528229096904397,
    "results_ttft_s_stddev": 0.05853017613187361,
    "results_end_to_end_latency_s_quantiles_p25": 10.952284958562814,
    "results_end_to_end_latency_s_quantiles_p50": 11.724067542003468,
    "results_end_to_end_latency_s_quantiles_p75": 12.392438833485357,
    "results_end_to_end_latency_s_quantiles_p90": 12.949160708626732,
    "results_end_to_end_latency_s_quantiles_p95": 13.369823349895887,
    "results_end_to_end_latency_s_quantiles_p99": 13.602660472076385,
    "results_end_to_end_latency_s_mean": 11.063488117179077,
    "results_end_to_end_latency_s_min": 2.310943207703531,
    "results_end_to_end_latency_s_max": 13.658869832754135,
    "results_end_to_end_latency_s_stddev": 2.5735290879206163,
    "results_request_output_throughput_token_per_s_quantiles_p25": 23.376963498120137,
    "results_request_output_throughput_token_per_s_quantiles_p50": 24.13135072660546,
    "results_request_output_throughput_token_per_s_quantiles_p75": 24.70095651189223,
    "results_request_output_throughput_token_per_s_quantiles_p90": 25.105406335351436,
    "results_request_output_throughput_token_per_s_quantiles_p95": 25.318698051259776,
    "results_request_output_throughput_token_per_s_quantiles_p99": 26.00064578019821,
    "results_request_output_throughput_token_per_s_mean": 23.819321580789712,
    "results_request_output_throughput_token_per_s_min": 19.61920693264775,
    "results_request_output_throughput_token_per_s_max": 26.11816971864744,
    "results_request_output_throughput_token_per_s_stddev": 1.3040854008387603,
    "results_number_input_tokens_quantiles_p25": 1419.75,
    "results_number_input_tokens_quantiles_p50": 1513.5,
    "results_number_input_tokens_quantiles_p75": 1585.25,
    "results_number_input_tokens_quantiles_p90": 1726.1000000000001,
    "results_number_input_tokens_quantiles_p95": 1812.2499999999998,
    "results_number_input_tokens_quantiles_p99": 1942.5299999999997,
    "results_number_input_tokens_mean": 1515.53125,
    "results_number_input_tokens_min": "1125",
    "results_number_input_tokens_max": "1986",
    "results_number_input_tokens_stddev": 157.1251617922921,
    "results_number_output_tokens_quantiles_p25": 265.75,
    "results_number_output_tokens_quantiles_p50": 285.0,
    "results_number_output_tokens_quantiles_p75": 296.25,
    "results_number_output_tokens_quantiles_p90": 317.0,
    "results_number_output_tokens_quantiles_p95": 322.0,
    "results_number_output_tokens_quantiles_p99": 338.84999999999997,
    "results_number_output_tokens_mean": 265.484375,
    "results_number_output_tokens_min": "47",
    "results_number_output_tokens_max": "342",
    "results_number_output_tokens_stddev": 66.06466101119273,
    "results_num_requests_started": 64,
    "results_error_rate": 0.0,
    "results_number_errors": 0,
    "results_error_code_frequency": "{}",
    "results_mean_output_throughput_token_per_s": 162.73324599263228,
    "results_num_completed_requests": 64,
    "results_num_completed_requests_per_min": 36.77803923322394,
    "timestamp": 1723922279
}

@81549361
Copy link

python3 -m sglang.check_env

Python: 3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:46:43) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 535.183.01
PyTorch: 2.4.0+cu121
flashinfer: 0.1.5+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.3
fastapi: 0.112.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 23.2
PIL: 10.3.0
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.0.3
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.8
anthropic: 0.34.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-159   0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-159   0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    0-159   0               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    0-159   0               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    0-159   0               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      0-159   0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

@81549361
Copy link

Not sure if this could be helpful or not, but I ran llmperf for both main branch and incoming branch. Overall this PR seems to make things much faster:

llmperf command used
main branch
incoming branch

What is your startup command?
I don't see any noticeable improvement in llama3 8b FP8.

@vhain
Copy link
Contributor

vhain commented Aug 17, 2024

@81549361 Startup command I used for both are the same:

python3 -m sglang.launch_server \
  --model-path "${MODEL}" \
  --host 127.0.0.1 \
  --port 8080 \
  --context-length "4096" \
  --max-prefill-tokens "16384" \
  --mem-fraction-static "0.85" \
  --schedule-conservativeness "0.05" \
  --tp-size "2" \
  --dp-size "1" \
  --log-level-http warning

@ispobock
Copy link
Collaborator Author

I don't see any noticeable improvement in llama3 8b FP8.

@81549361 Did you add --disable-flashinfer for both branches on llama3?

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Aug 18, 2024

Awesome! Will test DeepSeek-V2-Chat on 8*A800 next week.

Tested on A800-80G: DeepSeek-V2-Lite

Main branch ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  90.75
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214087
Request throughput (req/s):              11.02
Input token throughput (tok/s):          2602.12
Output token throughput (tok/s):         2375.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39248.93
Median E2E Latency (ms):                 34872.34
---------------Time to First Token----------------
Mean TTFT (ms):                          10523.55
Median TTFT (ms):                        10943.01
P99 TTFT (ms):                           15801.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          233.94
Median TPOT (ms):                        151.23
P99 TPOT (ms):                           1772.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           140.10
Median ITL (ms):                         117.96
P99 ITL (ms):                            385.24
==================================================
This PR ( DeepSeek-V2-Lite-Chat on 1 * A800-80G )
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  59.89
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214102
Request throughput (req/s):              16.70
Input token throughput (tok/s):          3942.60
Output token throughput (tok/s):         3599.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   25766.33
Median E2E Latency (ms):                 23320.27
---------------Time to First Token----------------
Mean TTFT (ms):                          9147.00
Median TTFT (ms):                        9517.37
P99 TTFT (ms):                           14099.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          161.66
Median TPOT (ms):                        72.40
P99 TPOT (ms):                           1690.97
---------------Inter-token Latency----------------
Mean ITL (ms):                           82.39
Median ITL (ms):                         56.74
P99 ITL (ms):                            247.01
==================================================

@halexan
Copy link

halexan commented Aug 18, 2024

Tested DeepSeek-V2-Chat-0628 on 8*A800

serve

python3 -m sglang.launch_server \
    --model-path /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
    --served-model-name deepseek-chat \
    --tp 8 \
    --enable-mla \
    --disable-radix-cache \
    --mem-fraction-static 0.87 \
    --schedule-conservativeness 0.1 \
    --chunked-prefill-size 32768 \
    --max-prefill-tokens 163840 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 50521

test

python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name sharegpt \
    --dataset-path /data/model-cache/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
    --model /data/model-cache/deepseek-ai/DeepSeek-V2-Chat-0628 \
    --port 50521

result

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  604.96
Total input tokens:                      236142
Total generated tokens:                  215614
Total generated tokens (retokenized):    214714
Request throughput (req/s):              1.65
Input token throughput (tok/s):          390.34
Output token throughput (tok/s):         356.41
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   374607.65
Median E2E Latency (ms):                 392302.17
---------------Time to First Token----------------
Mean TTFT (ms):                          184913.93
Median TTFT (ms):                        150008.79
P99 TTFT (ms):                           424698.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1651.19
Median TPOT (ms):                        1100.21
P99 TPOT (ms):                           10328.23
---------------Inter-token Latency----------------
Mean ITL (ms):                           890.30
Median ITL (ms):                         582.39
P99 ITL (ms):                            3893.44
==================================================

Should I use base model?
or my params not correct?

@zhyncs
Copy link
Member

zhyncs commented Aug 18, 2024

@halexan You don’t need to set this

--mem-fraction-static 0.87 \
    --schedule-conservativeness 0.1 \
    --chunked-prefill-size 32768 \
    --max-prefill-tokens 163840 \

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Aug 18, 2024

Tested DeepSeek-V2-Chat-0628 on 8*A800
server

/opt/conda/bin/python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Chat-0628 --tp 8 --trust-remote-code --enable-mla --disable-radix-cache

test

/opt/conda/bin/python -m sglang.bench_serving --backend sglang --num-prompts 3000
This PR ( DeepSeek-V2-Chat-0628 on 8 * A800-80G )
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     3000
Benchmark duration (s):                  498.49
Total input tokens:                      714456
Total generated tokens:                  656556
Total generated tokens (retokenized):    653778
Request throughput (req/s):              6.02
Input token throughput (tok/s):          1433.23
Output token throughput (tok/s):         1317.08
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   204276.17
Median E2E Latency (ms):                 205499.99
---------------Time to First Token----------------
Mean TTFT (ms):                          165516.98
Median TTFT (ms):                        164192.44
P99 TTFT (ms):                           353364.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          187.10
Median TPOT (ms):                        186.48
P99 TPOT (ms):                           398.82
---------------Inter-token Latency----------------
Mean ITL (ms):                           180.25
Median ITL (ms):                         108.96
P99 ITL (ms):                            567.61
==================================================

@halexan
Copy link

halexan commented Aug 18, 2024

@Xu-Chen

Does your 8*A800 has nvlink?

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Aug 19, 2024

@Xu-Chen

Does your 8*A800 has nvlink?

Yes

@zhyncs
Copy link
Member

zhyncs commented Aug 19, 2024

H100 SXM TP8 with DeepSeek V2

current PR

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  581.84
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1086980
Request throughput (req/s):              8.59
Input token throughput (tok/s):          2041.57
Output token throughput (tok/s):         1873.27
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   266797.24
Median E2E Latency (ms):                 272582.37
---------------Time to First Token----------------
Mean TTFT (ms):                          239227.95
Median TTFT (ms):                        248810.27
P99 TTFT (ms):                           488867.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          132.70
Median TPOT (ms):                        129.55
P99 TPOT (ms):                           281.64
---------------Inter-token Latency----------------
Mean ITL (ms):                           129.46
Median ITL (ms):                         78.23
P99 ITL (ms):                            453.92
==================================================

Compared to the main branch, it has improved by about 35%.

main branch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  777.04
Total input tokens:                      1187865
Total generated tokens:                  1089941
Total generated tokens (retokenized):    1087011
Request throughput (req/s):              6.43
Input token throughput (tok/s):          1528.70
Output token throughput (tok/s):         1402.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   358316.01
Median E2E Latency (ms):                 365857.50
---------------Time to First Token----------------
Mean TTFT (ms):                          320752.33
Median TTFT (ms):                        323528.82
P99 TTFT (ms):                           670386.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          176.45
Median TPOT (ms):                        176.47
P99 TPOT (ms):                           272.25
---------------Inter-token Latency----------------
Mean ITL (ms):                           175.99
Median ITL (ms):                         128.65
P99 ITL (ms):                            517.99
==================================================

I plan to merge this PR first, and the compatibility support for fp8 will be completed in another PR. @ispobock @merrymercy @Ying1123 @hnyls2002

@zhyncs
Copy link
Member

zhyncs commented Aug 19, 2024

To further improve performance, both W8A8 (FP8) and FP8 KV Cache are necessary and should be supported for DeepSeek V2.

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Aug 19, 2024

Furthermore, should pay attention to the MLA implementation of FlashInfer ( flashinfer-ai/flashinfer#237)

@zhyncs
Copy link
Member

zhyncs commented Aug 19, 2024

Furthermore, should pay attention to the MLA implementation of FlashInfer ( flashinfer-ai/flashinfer#237)

@jon-chuang When do you expect to complete the support for MLA in FlashInfer? May you synchronize the approximate time? Thanks.

@zhyncs zhyncs merged commit df19125 into sgl-project:main Aug 19, 2024
5 checks passed
@microwish
Copy link

@ispobock - do you mind telling a bit more about how you spotted this issue or this optimization?
Did you see the potential issue when profiling something?
Or were you directly inspired by InternLM/lmdeploy#1649?

@ispobock
Copy link
Collaborator Author

@microwish Yeah, we did the profiling first and found the decoding kernel took most of the time. And then we checked the kernel with ncu and get some directions for optimizing the memory access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants