[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs #3050

leepoly · 2025-01-22T11:32:09Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

Hi, I encountered an inconsistency in decode throughput reporting. When benchmarking with the bench_serving script, the reported TPOT is much lower than the decode throughput logged by the engine. This gap is significant for small models or high concurrency settings.

Reproduction

start the server

python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-0.5B \
  --trust-remote-code \
  --tp 1 \
  --load-format dummy \
  --port 30000 --host 127.0.0.1

benchmark (seqlens 2048 concurrency 16)

python3 -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-range-ratio 1.0 \
  --random-input-len 2048 \
  --random-output-len 256 \
  --num-prompts 16 \
  --max-concurrency 16 \
  --host 127.0.0.1 \
  --port 30000

Observed Results:

The bench_serving script reports a median TPOT of 4.45 ms, equating to a token throughput of 224 $\times$ 16 = 3584 tokens/second.
However, the engine logs show a decode throughput of 7026 tokens/second.

The gap between these metrics is significant and raises concerns about potential discrepancies in throughput measurement.

Please let me know if you need additional details or logs to assist in troubleshooting.

Environment

Python: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
NVIDIA H800 GPU
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 535.129.03
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0

The text was updated successfully, but these errors were encountered:

zhaochenyang20 · 2025-01-23T08:55:26Z

cc @zhyncs Who is probably on this part 🤔

zhaochenyang20 self-assigned this Jan 23, 2025

zhaochenyang20 added the help wanted Extra attention is needed label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs #3050

[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs #3050

leepoly commented Jan 22, 2025 •

edited

Loading

zhaochenyang20 commented Jan 23, 2025

[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs #3050

[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs #3050

Comments

leepoly commented Jan 22, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

start the server

benchmark (seqlens 2048 concurrency 16)

Environment

zhaochenyang20 commented Jan 23, 2025

leepoly commented Jan 22, 2025 •

edited

Loading