[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

dmitry-gorokhov · 2024-09-04T09:20:51Z

Request Description

PagedAttention operation is already implemented in bounds of CPU plugin using C++ and optimized for x64 using avx2/avx512 instrinsics.
The request is to optimize PA operation for aarch64 using NEON/SVE extensions.

Please refer to SDPA optimization using NEON for reference.
How to build OV on ARM: https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md

Feature Use Case

PagedAttention operation implements attention algo required for workloads like continuous batching or speculative decoding. PagedAttention is used as basic attention block in VLLM OpenVINO backend and under OpenVINO GenAI API (for some use-cases). PA operation might take significant resources for execution (especially for long contexts), so its optimization is crucial for overall LLM based workloads.

Issue submission checklist

The feature request or improvement must be related to OpenVINO

rkazants · 2024-09-05T15:38:51Z

@dmitry-gorokhov, can we add good-first-issue label here?

samkitshah1262 · 2024-09-28T10:18:15Z

Hi @dmitry-gorokhov @rkazants @wenjiew , is anyone working on this issue ? If not can I take it up ?

mlukasze · 2024-09-30T04:44:15Z

it's your now, have fun :)

samkitshah1262 · 2024-10-05T10:40:34Z

Hi , currently the paged attention executor logic depends on brgemm (only supports x86-64) , so to extend the support for ARM-64 (aarach-64) , which library should we use. following are the prospects - ARM compute library(ACL) , OpenBLAS, Eigen, TFlite.

dmitry-gorokhov · 2024-10-07T06:27:30Z

Hi @samkitshah1262. OneDNN has Brgemm block implemented for aarch64. The problem is it supports only SVE512/SVE256.
So in order to cover Neon powered devices (like Apple silicon) we need to use another backend. I would propose to take a look on ACL NEGEMM kernel. It was already successfully applied for the same purpose inside SDPA op: #25183

samkitshah1262 · 2024-10-09T13:43:18Z

@dmitry-gorokhov Implementation for PA on ARM using NEON and ACL is done , except the MHAHelper class as the ACL Gemm kernal does not support stride (yet). Could you please review the current changes and provide direction. Thanks!

ashwins990 · 2025-01-03T06:38:57Z

Hi @dmitry-gorokhov,
I am working towards enabling f16 precision for pagedAttention for ARM.
To enable f16 pagedattention node, I made the below changes:

setting hints::inference_precsion to f16
forcing f16 infer_precision in file intel_cpu/src/graph.cpp
made changes to transformation_pipeline.cpp

I found that pagedAttention is still falling back to f32 precision. Could you please refer me to some documentation regarding the same. Thank you.

dmitry-gorokhov · 2025-01-03T07:05:35Z

@ashwins990 Have you modified this condition

openvino/src/plugins/intel_cpu/src/nodes/paged_attn.cpp

Lines 221 to 223 in bd04c5a

    
           } else if (rtPrecision == ov::element::f16 && ov::with_cpu_x86_avx512_core_fp16()) { 
        
               rtPrecision = ov::element::f16; 
        
           } else {

? It explicitly check for avx512 isa.
You need to do smt like that https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp#L1855-L1856

This development is related to Feature Request : #26422 ## Benchmarking Results Machine : Graviton 3 - 64 cores #### vLLM serving benchmark on ShareGPT dataset | Requests / sec | avg TTFT ( sec ) | avg TPOT ( sec ) | output / Total - Throughput ( tokens/ sec ) | | ----------------- | ------------------ | -------------------|------------------------------------------------| | 0.2 | 1.153 | 0.186 | 38.73 / 77.79 | | 0.5 | 2.083 | 0.482 | 94.10 / 187.92 | #### vLLM Throughput benchmark on ShareGPT dataset ![plot_aarch64_sve](https://github.com/user-attachments/assets/54c49326-2500-4744-9b4b-72c61f1db2af) ## vLLM with openvino backend Clone the [vLLM repo](https://github.com/vllm-project/vllm) Set inference precision as f32 before model compilation by setting ```Execution Mode``` to ```ACCURACY``` ``` // file_path : vllm/model_executor/model_loader/openvino.py import openvino.properties.hint as hints ov_core.set_property( "CPU", {hints.execution_mode: hints.ExecutionMode.ACCURACY}, ) ov_compiled = ov_core.compile_model(pt_model.model, ov_device) self.ov_request = ov_compiled.create_infer_request() ``` Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes. After the above change, Follow this [link](https://docs.vllm.ai/en/v0.6.3.post1/getting_started/openvino-installation.html) and install vLLM from source.

This development is related to Feature Request : openvinotoolkit#26422 ## Benchmarking Results Machine : Graviton 3 - 64 cores #### vLLM serving benchmark on ShareGPT dataset | Requests / sec | avg TTFT ( sec ) | avg TPOT ( sec ) | output / Total - Throughput ( tokens/ sec ) | | ----------------- | ------------------ | -------------------|------------------------------------------------| | 0.2 | 1.153 | 0.186 | 38.73 / 77.79 | | 0.5 | 2.083 | 0.482 | 94.10 / 187.92 | #### vLLM Throughput benchmark on ShareGPT dataset ![plot_aarch64_sve](https://github.com/user-attachments/assets/54c49326-2500-4744-9b4b-72c61f1db2af) ## vLLM with openvino backend Clone the [vLLM repo](https://github.com/vllm-project/vllm) Set inference precision as f32 before model compilation by setting ```Execution Mode``` to ```ACCURACY``` ``` // file_path : vllm/model_executor/model_loader/openvino.py import openvino.properties.hint as hints ov_core.set_property( "CPU", {hints.execution_mode: hints.ExecutionMode.ACCURACY}, ) ov_compiled = ov_core.compile_model(pt_model.model, ov_device) self.ov_request = ov_compiled.create_infer_request() ``` Note : If we don't set the inference precision as f32 it will take the f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ] due to the presence of [ Optional Variable - Alibi param ] in transformation graph. Optional variables are graph nodes with empty shapes. After the above change, Follow this [link](https://docs.vllm.ai/en/v0.6.3.post1/getting_started/openvino-installation.html) and install vLLM from source.

ashwins990 · 2025-02-12T08:24:56Z

Hi @dmitry-gorokhov

Is there a way to perform GEMM f16 for ARM for small matrices as NGEMM ACL have very significant overhead for small matrices.?

Here are the performance analysis comparision.
Model :: LLAMA2-7B on Graviton 3E

Experiment 1 : keeping prompt tokens less than 32, and generation tokens 256:

We get 70% improvement over f32 implementation in total throughput

Experiment 2 : server benchmark on ShareGPT dataset

{prompt and generation tokens are 1:1 ratio approx }

Variant	Total Throughput	throughput %	TTFT	TPOT
f32 with BRGEMM	327	reference	reference	reference
f16 with ACL	232	-29%	-650%	-354%
f16 with custom gemm	318	-3%	-8%	-2%

Analysis

The two GEMM operation performed are q[32 * 128] X k[ 128 X 32 ] and w[32 * 32] X v[ 32 X 128 ].
ACL NGEMM used to carry these multiplication carries the penalty of initialization [esp. for small martrix ] when done multiple times.
Therefore a very basic matrix multiplication is able to produce better results than NGEMM in this case.

dmitry-gorokhov · 2025-02-17T05:40:22Z

@ashwins990 I would recommend to create PR with current ACL NEGEMM based impl. so we could help with the review and investigation of the possible solutions.
AFAIK, NEGEMM is recommended ACL primitive for matrix multiplication, so the key is to avoid initialization overhead on each iteration (if possible).
Alternative solution might be using KleidiAI microkernels library, which integration we are finalizing now.

dmitry-gorokhov added enhancement New feature or request feature New feature request platform: arm OpenVINO on ARM / ARM64 labels Sep 4, 2024

wenjiew added the good first issue Good for newcomers label Sep 6, 2024

github-project-automation bot added this to Good first issues Sep 6, 2024

github-project-automation bot moved this to Contributors Needed in Good first issues Sep 6, 2024

mlukasze assigned samkitshah1262 Sep 30, 2024

mlukasze moved this from Contributors Needed to Assigned in Good first issues Sep 30, 2024

dmitry-gorokhov added the category: CPU OpenVINO CPU plugin label Sep 30, 2024

samkitshah1262 mentioned this issue Oct 9, 2024

[WIP] [CPU] [ARM] Paged attention optimization for aarch64 #26975

Closed

ashwins990 mentioned this issue Nov 30, 2024

Aarch64 paged attention enablement #27841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

dmitry-gorokhov commented Sep 4, 2024 •

edited

Loading

rkazants commented Sep 5, 2024

samkitshah1262 commented Sep 28, 2024

mlukasze commented Sep 30, 2024

samkitshah1262 commented Oct 5, 2024

dmitry-gorokhov commented Oct 7, 2024

samkitshah1262 commented Oct 9, 2024

ashwins990 commented Jan 3, 2025

dmitry-gorokhov commented Jan 3, 2025 •

edited

Loading

ashwins990 commented Feb 12, 2025 •

edited

Loading

dmitry-gorokhov commented Feb 17, 2025

[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

Comments

dmitry-gorokhov commented Sep 4, 2024 • edited Loading

Request Description

Feature Use Case

Issue submission checklist

rkazants commented Sep 5, 2024

samkitshah1262 commented Sep 28, 2024

mlukasze commented Sep 30, 2024

samkitshah1262 commented Oct 5, 2024

dmitry-gorokhov commented Oct 7, 2024

samkitshah1262 commented Oct 9, 2024

ashwins990 commented Jan 3, 2025

dmitry-gorokhov commented Jan 3, 2025 • edited Loading

ashwins990 commented Feb 12, 2025 • edited Loading

Is there a way to perform GEMM f16 for ARM for small matrices as NGEMM ACL have very significant overhead for small matrices.?

Experiment 1 : keeping prompt tokens less than 32, and generation tokens 256:

Experiment 2 : server benchmark on ShareGPT dataset

Analysis

dmitry-gorokhov commented Feb 17, 2025

dmitry-gorokhov commented Sep 4, 2024 •

edited

Loading

dmitry-gorokhov commented Jan 3, 2025 •

edited

Loading

ashwins990 commented Feb 12, 2025 •

edited

Loading