-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422
Comments
@dmitry-gorokhov, can we add |
Hi @dmitry-gorokhov @rkazants @wenjiew , is anyone working on this issue ? If not can I take it up ? |
it's your now, have fun :) |
Hi , currently the paged attention executor logic depends on brgemm (only supports x86-64) , so to extend the support for ARM-64 (aarach-64) , which library should we use. following are the prospects - ARM compute library(ACL) , OpenBLAS, Eigen, TFlite. |
Hi @samkitshah1262. OneDNN has Brgemm block implemented for aarch64. The problem is it supports only SVE512/SVE256. |
@dmitry-gorokhov Implementation for PA on ARM using NEON and ACL is done , except the MHAHelper class as the ACL Gemm kernal does not support stride (yet). Could you please review the current changes and provide direction. Thanks! |
Hi @dmitry-gorokhov,
I found that pagedAttention is still falling back to f32 precision. Could you please refer me to some documentation regarding the same. Thank you. |
@ashwins990 Have you modified this condition openvino/src/plugins/intel_cpu/src/nodes/paged_attn.cpp Lines 221 to 223 in bd04c5a
You need to do smt like that https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp#L1855-L1856 |
This development is related to Feature Request : #26422 ## Benchmarking Results Machine : Graviton 3 - 64 cores #### vLLM serving benchmark on ShareGPT dataset | Requests / sec | avg TTFT ( sec ) | avg TPOT ( sec ) | output / Total - Throughput ( tokens/ sec ) | | ----------------- | ------------------ | -------------------|------------------------------------------------| | 0.2 | 1.153 | 0.186 | 38.73 / 77.79 | | 0.5 | 2.083 | 0.482 | 94.10 / 187.92 | #### vLLM Throughput benchmark on ShareGPT dataset  ## vLLM with openvino backend Clone the [vLLM repo](https://github.com/vllm-project/vllm) Set inference precision as f32 before model compilation by setting ```Execution Mode``` to ```ACCURACY``` ``` // file_path : vllm/model_executor/model_loader/openvino.py import openvino.properties.hint as hints ov_core.set_property( "CPU", {hints.execution_mode: hints.ExecutionMode.ACCURACY}, ) ov_compiled = ov_core.compile_model(pt_model.model, ov_device) self.ov_request = ov_compiled.create_infer_request() ``` Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes. After the above change, Follow this [link](https://docs.vllm.ai/en/v0.6.3.post1/getting_started/openvino-installation.html) and install vLLM from source.
This development is related to Feature Request : openvinotoolkit#26422 ## Benchmarking Results Machine : Graviton 3 - 64 cores #### vLLM serving benchmark on ShareGPT dataset | Requests / sec | avg TTFT ( sec ) | avg TPOT ( sec ) | output / Total - Throughput ( tokens/ sec ) | | ----------------- | ------------------ | -------------------|------------------------------------------------| | 0.2 | 1.153 | 0.186 | 38.73 / 77.79 | | 0.5 | 2.083 | 0.482 | 94.10 / 187.92 | #### vLLM Throughput benchmark on ShareGPT dataset  ## vLLM with openvino backend Clone the [vLLM repo](https://github.com/vllm-project/vllm) Set inference precision as f32 before model compilation by setting ```Execution Mode``` to ```ACCURACY``` ``` // file_path : vllm/model_executor/model_loader/openvino.py import openvino.properties.hint as hints ov_core.set_property( "CPU", {hints.execution_mode: hints.ExecutionMode.ACCURACY}, ) ov_compiled = ov_core.compile_model(pt_model.model, ov_device) self.ov_request = ov_compiled.create_infer_request() ``` Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes. After the above change, Follow this [link](https://docs.vllm.ai/en/v0.6.3.post1/getting_started/openvino-installation.html) and install vLLM from source.
Is there a way to perform GEMM f16 for ARM for small matrices as NGEMM ACL have very significant overhead for small matrices.?Here are the performance analysis comparision. Experiment 1 : keeping prompt tokens less than 32, and generation tokens 256:We get Experiment 2 : server benchmark on ShareGPT dataset{prompt and generation tokens are 1:1 ratio approx }
AnalysisThe two GEMM operation performed are q[32 * 128] X k[ 128 X 32 ] and w[32 * 32] X v[ 32 X 128 ]. |
@ashwins990 I would recommend to create PR with current ACL NEGEMM based impl. so we could help with the review and investigation of the possible solutions. |
Request Description
PagedAttention operation is already implemented in bounds of CPU plugin using C++ and optimized for x64 using avx2/avx512 instrinsics.
The request is to optimize PA operation for aarch64 using NEON/SVE extensions.
Please refer to SDPA optimization using NEON for reference.
How to build OV on ARM: https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md
Feature Use Case
PagedAttention operation implements attention algo required for workloads like continuous batching or speculative decoding. PagedAttention is used as basic attention block in VLLM OpenVINO backend and under OpenVINO GenAI API (for some use-cases). PA operation might take significant resources for execution (especially for long contexts), so its optimization is crucial for overall LLM based workloads.
Issue submission checklist
The text was updated successfully, but these errors were encountered: