Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Optimize PagedAttention operation on aarch64 HW #26422

Open
1 task done
dmitry-gorokhov opened this issue Sep 4, 2024 · 10 comments
Open
1 task done
Assignees
Labels
category: CPU OpenVINO CPU plugin enhancement New feature or request feature New feature request good first issue Good for newcomers platform: arm OpenVINO on ARM / ARM64

Comments

@dmitry-gorokhov
Copy link
Contributor

dmitry-gorokhov commented Sep 4, 2024

Request Description

PagedAttention operation is already implemented in bounds of CPU plugin using C++ and optimized for x64 using avx2/avx512 instrinsics.
The request is to optimize PA operation for aarch64 using NEON/SVE extensions.

Please refer to SDPA optimization using NEON for reference.
How to build OV on ARM: https://github.com/openvinotoolkit/openvino/blob/master/docs/dev/build.md

Feature Use Case

PagedAttention operation implements attention algo required for workloads like continuous batching or speculative decoding. PagedAttention is used as basic attention block in VLLM OpenVINO backend and under OpenVINO GenAI API (for some use-cases). PA operation might take significant resources for execution (especially for long contexts), so its optimization is crucial for overall LLM based workloads.

Issue submission checklist

  • The feature request or improvement must be related to OpenVINO
@dmitry-gorokhov dmitry-gorokhov added enhancement New feature or request feature New feature request platform: arm OpenVINO on ARM / ARM64 labels Sep 4, 2024
@rkazants
Copy link
Member

rkazants commented Sep 5, 2024

@dmitry-gorokhov, can we add good-first-issue label here?

@wenjiew wenjiew added the good first issue Good for newcomers label Sep 6, 2024
@github-project-automation github-project-automation bot moved this to Contributors Needed in Good first issues Sep 6, 2024
@samkitshah1262
Copy link

Hi @dmitry-gorokhov @rkazants @wenjiew , is anyone working on this issue ? If not can I take it up ?

@mlukasze mlukasze moved this from Contributors Needed to Assigned in Good first issues Sep 30, 2024
@mlukasze
Copy link
Contributor

it's your now, have fun :)

@dmitry-gorokhov dmitry-gorokhov added the category: CPU OpenVINO CPU plugin label Sep 30, 2024
@samkitshah1262
Copy link

Hi , currently the paged attention executor logic depends on brgemm (only supports x86-64) , so to extend the support for ARM-64 (aarach-64) , which library should we use. following are the prospects - ARM compute library(ACL) , OpenBLAS, Eigen, TFlite.

@dmitry-gorokhov
Copy link
Contributor Author

Hi @samkitshah1262. OneDNN has Brgemm block implemented for aarch64. The problem is it supports only SVE512/SVE256.
So in order to cover Neon powered devices (like Apple silicon) we need to use another backend. I would propose to take a look on ACL NEGEMM kernel. It was already successfully applied for the same purpose inside SDPA op: #25183

@samkitshah1262
Copy link

@dmitry-gorokhov Implementation for PA on ARM using NEON and ACL is done , except the MHAHelper class as the ACL Gemm kernal does not support stride (yet). Could you please review the current changes and provide direction. Thanks!

@ashwins990
Copy link
Contributor

Hi @dmitry-gorokhov,
I am working towards enabling f16 precision for pagedAttention for ARM.
To enable f16 pagedattention node, I made the below changes:

  • setting hints::inference_precsion to f16
  • forcing f16 infer_precision in file intel_cpu/src/graph.cpp
  • made changes to transformation_pipeline.cpp

I found that pagedAttention is still falling back to f32 precision. Could you please refer me to some documentation regarding the same. Thank you.

@dmitry-gorokhov
Copy link
Contributor Author

dmitry-gorokhov commented Jan 3, 2025

@ashwins990 Have you modified this condition

} else if (rtPrecision == ov::element::f16 && ov::with_cpu_x86_avx512_core_fp16()) {
rtPrecision = ov::element::f16;
} else {
? It explicitly check for avx512 isa.
You need to do smt like that https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_cpu/src/nodes/scaled_attn.cpp#L1855-L1856

github-merge-queue bot pushed a commit that referenced this issue Jan 10, 2025
This development is related to Feature Request :
#26422

## Benchmarking Results 
Machine : Graviton 3 - 64 cores
#### vLLM serving benchmark on ShareGPT dataset

| Requests / sec | avg TTFT ( sec ) | avg TPOT ( sec ) | output / Total
- Throughput ( tokens/ sec ) |
| ----------------- | ------------------ |
-------------------|------------------------------------------------|
| 0.2 | 1.153 | 0.186 | 38.73 / 77.79 |
| 0.5 | 2.083 | 0.482 | 94.10 / 187.92 |

#### vLLM Throughput  benchmark on ShareGPT dataset

![plot_aarch64_sve](https://github.com/user-attachments/assets/54c49326-2500-4744-9b4b-72c61f1db2af)

## vLLM with openvino backend

Clone the [vLLM repo](https://github.com/vllm-project/vllm)
Set inference precision as f32 before model compilation by setting
```Execution Mode``` to ```ACCURACY```
```
// file_path : vllm/model_executor/model_loader/openvino.py
import openvino.properties.hint as hints
ov_core.set_property(
            "CPU",
            {hints.execution_mode: hints.ExecutionMode.ACCURACY},
        )
ov_compiled = ov_core.compile_model(pt_model.model, ov_device)
self.ov_request = ov_compiled.create_infer_request()
```
Note : If we don't set the inference precision as f32 it will take the
f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ]
due to the presence of [ Optional Variable - Alibi param ] in
transformation graph. Optional variables are graph nodes with empty
shapes.
 
After the above change, Follow this
[link](https://docs.vllm.ai/en/v0.6.3.post1/getting_started/openvino-installation.html)
and install vLLM from source.
MirceaDan99 pushed a commit to MirceaDan99/openvino that referenced this issue Jan 22, 2025
This development is related to Feature Request :
openvinotoolkit#26422

## Benchmarking Results 
Machine : Graviton 3 - 64 cores
#### vLLM serving benchmark on ShareGPT dataset

| Requests / sec | avg TTFT ( sec ) | avg TPOT ( sec ) | output / Total
- Throughput ( tokens/ sec ) |
| ----------------- | ------------------ |
-------------------|------------------------------------------------|
| 0.2 | 1.153 | 0.186 | 38.73 / 77.79 |
| 0.5 | 2.083 | 0.482 | 94.10 / 187.92 |

#### vLLM Throughput  benchmark on ShareGPT dataset

![plot_aarch64_sve](https://github.com/user-attachments/assets/54c49326-2500-4744-9b4b-72c61f1db2af)

## vLLM with openvino backend

Clone the [vLLM repo](https://github.com/vllm-project/vllm)
Set inference precision as f32 before model compilation by setting
```Execution Mode``` to ```ACCURACY```
```
// file_path : vllm/model_executor/model_loader/openvino.py
import openvino.properties.hint as hints
ov_core.set_property(
            "CPU",
            {hints.execution_mode: hints.ExecutionMode.ACCURACY},
        )
ov_compiled = ov_core.compile_model(pt_model.model, ov_device)
self.ov_request = ov_compiled.create_infer_request()
```
Note : If we don't set the inference precision as f32 it will take the
f16 precision path. This can lead to Segmentation Fault [ in Aarch64 ]
due to the presence of [ Optional Variable - Alibi param ] in
transformation graph. Optional variables are graph nodes with empty
shapes.
 
After the above change, Follow this
[link](https://docs.vllm.ai/en/v0.6.3.post1/getting_started/openvino-installation.html)
and install vLLM from source.
@ashwins990
Copy link
Contributor

ashwins990 commented Feb 12, 2025

Hi @dmitry-gorokhov

Is there a way to perform GEMM f16 for ARM for small matrices as NGEMM ACL have very significant overhead for small matrices.?

Here are the performance analysis comparision.
Model :: LLAMA2-7B on Graviton 3E

Experiment 1 : keeping prompt tokens less than 32, and generation tokens 256:

We get 70% improvement over f32 implementation in total throughput

Experiment 2 : server benchmark on ShareGPT dataset

{prompt and generation tokens are 1:1 ratio approx }

Variant Total Throughput throughput % TTFT TPOT
f32 with BRGEMM 327 reference reference reference
f16 with ACL 232 -29% -650% -354%
f16 with custom gemm 318 -3% -8% -2%

Analysis

The two GEMM operation performed are q[32 * 128] X k[ 128 X 32 ] and w[32 * 32] X v[ 32 X 128 ].
ACL NGEMM used to carry these multiplication carries the penalty of initialization [esp. for small martrix ] when done multiple times.
Therefore a very basic matrix multiplication is able to produce better results than NGEMM in this case.

@dmitry-gorokhov
Copy link
Contributor Author

@ashwins990 I would recommend to create PR with current ACL NEGEMM based impl. so we could help with the review and investigation of the possible solutions.
AFAIK, NEGEMM is recommended ACL primitive for matrix multiplication, so the key is to avoid initialization overhead on each iteration (if possible).
Alternative solution might be using KleidiAI microkernels library, which integration we are finalizing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin enhancement New feature or request feature New feature request good first issue Good for newcomers platform: arm OpenVINO on ARM / ARM64
Projects
Status: Assigned
Development

Successfully merging a pull request may close this issue.

6 participants