Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs #3050

Open
5 tasks done
leepoly opened this issue Jan 22, 2025 · 1 comment
Open
5 tasks done
Assignees
Labels
help wanted Extra attention is needed

Comments

@leepoly
Copy link

leepoly commented Jan 22, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hi, I encountered an inconsistency in decode throughput reporting. When benchmarking with the bench_serving script, the reported TPOT is much lower than the decode throughput logged by the engine. This gap is significant for small models or high concurrency settings.

Reproduction

start the server

python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-0.5B \
  --trust-remote-code \
  --tp 1 \
  --load-format dummy \
  --port 30000 --host 127.0.0.1

benchmark (seqlens 2048 concurrency 16)

python3 -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-range-ratio 1.0 \
  --random-input-len 2048 \
  --random-output-len 256 \
  --num-prompts 16 \
  --max-concurrency 16 \
  --host 127.0.0.1 \
  --port 30000

Observed Results:

  • The bench_serving script reports a median TPOT of 4.45 ms, equating to a token throughput of 224 $\times$ 16 = 3584 tokens/second.
  • However, the engine logs show a decode throughput of 7026 tokens/second.
Image Image

The gap between these metrics is significant and raises concerns about potential discrepancies in throughput measurement.

Please let me know if you need additional details or logs to assist in troubleshooting.

Environment

Python: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
NVIDIA H800 GPU
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 535.129.03
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0

@zhaochenyang20 zhaochenyang20 self-assigned this Jan 23, 2025
@zhaochenyang20 zhaochenyang20 added the help wanted Extra attention is needed label Jan 23, 2025
@zhaochenyang20
Copy link
Collaborator

cc @zhyncs Who is probably on this part 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants