Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ChatQnA] Remove enforce-eager to enable HPU graphs for better vLLM perf #1210

Merged
merged 4 commits into from
Dec 10, 2024

Conversation

wangkl2
Copy link
Collaborator

@wangkl2 wangkl2 commented Nov 28, 2024

Description

Remove the --enforce-eager flag for vllm-gaudi service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.

Referenced benchmarking results ratio of llmserve on a 7B LLM on Gaudi2 before and after this change:
Note: keeping all other parameters consistent, and the geomean is calculated on the normalized perf results compared to the original setting measured on different input/output seq lengths including 128/128, 128/1024, 1024/128, 1024/1024.

Setting Execution Mode Geomean of Normalized Avg TTFT Geomean of Normalized Avg TPOT Geomean of Normalized Avg Total Latency Geomean of Normalized Output Tokens/s
PT_HPU_LAZY_MODE=1, enforce-eager=1, max-num-seqs=256 (the original) Lazy Mode 1.00 1.00 1.00 1.00
PT_HPU_LAZY_MODE=1, enforce-eager=0, max-num-seqs=256 Lazy Mode, with HPU graphs 1.06 0.28 0.33 3.03
NA Perf Improvement with HPU Graphs 1/1.06=0.94X 1/0.28=3.57X 1/0.33=3.03X 3.03/1=3.03X

Issues

n/a

Type of change

  • Others (enhancement, documentation, validation, etc.)

Dependencies

n/a

Tests

Benchmark with GenAIEval.

@wangkl2 wangkl2 requested review from XinyaoWa and lvliang-intel and removed request for lvliang-intel November 28, 2024 07:13
@XinyaoWa
Copy link
Collaborator

Could you please also help to update GenAIComps setttings? https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm

@eero-t
Copy link
Contributor

eero-t commented Dec 2, 2024

Remove the --enforce-eager flag for vllm-gaudi service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.

Test matrix did not include "PT_HPU_LAZY_MODE=0, enforce-eager=1" results?

According to: https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html

When there’s large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible.
...
With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding --enforce-eager flag to server

=> Eager mode works best when there are lots of (parallel) requests (and therefore larger batches) i.e. when performance matters most. Was that tested too?

@wangkl2
Copy link
Collaborator Author

wangkl2 commented Dec 9, 2024

Remove the --enforce-eager flag for vllm-gaudi service, to enable HPU graphs optimization as default. It will improve both OOB latency and OOB throughput on Gaudi SW 1.18.

Test matrix did not include "PT_HPU_LAZY_MODE=0, enforce-eager=1" results?

For the latest SW stack version, eager mode performance still has perf gap with either lazy mode or TorchDynamo mode. Referring to https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#execution-modes, PT_HPU_LAZY_MODE=0 is highly experimental and only used for functionality test.

According to: https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html

When there’s large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible.
...
With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding --enforce-eager flag to server

=> Eager mode works best when there are lots of (parallel) requests (and therefore larger batches) i.e. when performance matters most. Was that tested too?

The test covers both smaller and larger concurrent requests for each sets of input/output seq len, and the performance ratio is the geomean of different num-of-requests and seq len.

And I think this sentence you quote is only comparing lower batches with HPU Graphs disabled and larger batches with HPU Graphs disabled. Increasing the number of requests at a time would tend to increase the throughput while relatively better latency for smaller requests.

Regarding the maximum batch size, we use --max-num-seqs flag to control the schedular to process the requests. I've tested that either lazy mode and lazy mode+hpu graphs enabled, the current default --max-num-seqs=256 in opea is optimal for 7B model for general generation configs. Enlarges to even larger values such as 512 (lazy mode or hpu graphs enabled) only provides very limited perf improvement for some specific input/output seq len.

@eero-t
Copy link
Contributor

eero-t commented Dec 9, 2024

@wangkl2 Thanks!

=> I'll update those args for my vLLM enabling PR in "GenAIInfra": opea-project/GenAIInfra#610

@lvliang-intel lvliang-intel merged commit 4c01e14 into opea-project:main Dec 10, 2024
22 checks passed
chyundunovDatamonsters pushed a commit to chyundunovDatamonsters/OPEA-GenAIExamples that referenced this pull request Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants