You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe the report provided performance results for a single request, which is not conceptually the same with concurrency one.
With concurrent requests, the maximum throughput (tokens per second) should be stable for vllm whatever the concurrency. However, the latency (e.g., time to first token) for each request varies significantly depending on the vllm configuration and the data distribution.
Model Series
Qwen2.5
What are the models used?
Qwen2.5-7B-Instruct
What is the scenario where the problem happened?
Qwen2.5-7B-Instruct 性能
Is this badcase known and can it be solved using avaiable techniques?
Information about environment
NVIDIA A100 80GB
CUDA 12.1
vLLM 0.6.3
Pytorch 2.4.0
Flash Attention 2.6.3
Transformers 4.46.0
Description
Steps to reproduce
This happens to Qwen2.5-xB-Instruct-xxx and xxx.
The badcase can be reproduced with the following steps:
The following example input & output can be used:
Expected results
The results are expected to be ...
Attempts to fix
I have tried several ways to fix this, including:
Anything else helpful for investigation
I find that this problem also happens to ...
The text was updated successfully, but these errors were encountered: