You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
Hi, I encountered an inconsistency in decode throughput reporting. When benchmarking with the bench_serving script, the reported TPOT is much lower than the decode throughput logged by the engine. This gap is significant for small models or high concurrency settings.
Checklist
Describe the bug
Hi, I encountered an inconsistency in decode throughput reporting. When benchmarking with the bench_serving script, the reported TPOT is much lower than the decode throughput logged by the engine. This gap is significant for small models or high concurrency settings.
Reproduction
start the server
benchmark (seqlens 2048 concurrency 16)
Observed Results:
The gap between these metrics is significant and raises concerns about potential discrepancies in throughput measurement.
Please let me know if you need additional details or logs to assist in troubleshooting.
Environment
Python: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
NVIDIA H800 GPU
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 535.129.03
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.1
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.7
anthropic: 0.43.0
decord: 0.6.0
The text was updated successfully, but these errors were encountered: