-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf #1210
Conversation
Signed-off-by: Wang, Kai Lawrence <[email protected]>
Could you please also help to update GenAIComps setttings? https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm |
Test matrix did not include "PT_HPU_LAZY_MODE=0, enforce-eager=1" results? According to: https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
=> Eager mode works best when there are lots of (parallel) requests (and therefore larger batches) i.e. when performance matters most. Was that tested too? |
For the latest SW stack version, eager mode performance still has perf gap with either lazy mode or TorchDynamo mode. Referring to https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#execution-modes,
The test covers both smaller and larger concurrent requests for each sets of input/output seq len, and the performance ratio is the geomean of different num-of-requests and seq len. And I think this sentence you quote is only comparing lower batches with HPU Graphs disabled and larger batches with HPU Graphs disabled. Increasing the number of requests at a time would tend to increase the throughput while relatively better latency for smaller requests. Regarding the maximum batch size, we use |
@wangkl2 Thanks! => I'll update those args for my vLLM enabling PR in "GenAIInfra": opea-project/GenAIInfra#610 |
Signed-off-by: Wang, Kai Lawrence <[email protected]>
…erf (opea-project#1210) Signed-off-by: Wang, Kai Lawrence <[email protected]> Signed-off-by: Chingis Yundunov <[email protected]>
Description
Remove the
--enforce-eager
flag forvllm-gaudi
service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.Referenced benchmarking results ratio of
llmserve
on a 7B LLM on Gaudi2 before and after this change:Note: keeping all other parameters consistent, and the geomean is calculated on the normalized perf results compared to the original setting measured on different input/output seq lengths including
128/128, 128/1024, 1024/128, 1024/1024
.Issues
n/a
Type of change
Dependencies
n/a
Tests
Benchmark with GenAIEval.