Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Badcase]: 性能评测报告A100卡只有1并发报告;8并发,16并发请求性能数据能达到多少? #1155

Open
4 tasks done
AuSong opened this issue Jan 7, 2025 · 1 comment

Comments

@AuSong
Copy link

AuSong commented Jan 7, 2025

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7B-Instruct

What is the scenario where the problem happened?

Qwen2.5-7B-Instruct 性能

Is this badcase known and can it be solved using avaiable techniques?

  • I have followed the GitHub README.
  • I have checked the Qwen documentation and cannot find a solution there.
  • I have checked the documentation of the related framework and cannot find useful information.
  • I have searched the issues and there is not a similar one.

Information about environment

NVIDIA A100 80GB

CUDA 12.1

vLLM 0.6.3

Pytorch 2.4.0

Flash Attention 2.6.3

Transformers 4.46.0

Description

Steps to reproduce

This happens to Qwen2.5-xB-Instruct-xxx and xxx.
The badcase can be reproduced with the following steps:

  1. ...
  2. ...

The following example input & output can be used:

system: ...
user: ...
...

Expected results

The results are expected to be ...

Attempts to fix

I have tried several ways to fix this, including:

  1. adjusting the sampling parameters, but ...
  2. prompt engineering, but ...

Anything else helpful for investigation

I find that this problem also happens to ...

@jklj077
Copy link
Collaborator

jklj077 commented Jan 13, 2025

I believe the report provided performance results for a single request, which is not conceptually the same with concurrency one.

With concurrent requests, the maximum throughput (tokens per second) should be stable for vllm whatever the concurrency. However, the latency (e.g., time to first token) for each request varies significantly depending on the vllm configuration and the data distribution.

We recommend benchmarking the results on your own with the vllm official script at https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py which can simulate max concurrency, request rate, burstiness, etc., with a sample dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants