optimize gqa cpu #20598

yufenglee · 2024-05-07T18:59:31Z

Description

optimize the GQA implementation on CPU. Mainly optimization are:

compute attention on real total sequence length instead of maximum sequence length in case past/present share same buffer
remove the mask
remove the transpose after attention x value

It improve the phi3 model https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py with max sequence length 2k/4k from 10 tps to 20 tps.

Motivation and Context

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h

### Description  optimize the GQA implementation on CPU. Mainly optimization are: 1. compute attention on real total sequence length instead of maximum sequence length in case past/present share same buffer 2. remove the mask 3. remove the transpose after attention x value It improve the phi3 model https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py with max sequence length 2k/4k from 10 tps to 20 tps. ### Motivation and Context

### Description This PR adds support for adding GroupQueryAttention (GQA) in models that are running on CPU. ### Motivation and Context Previously, the LLaMA scripts supported creating models that have GQA for CUDA only. With the recently added support for [GQA on CPU](#20299), models where `num_attention_heads != num_key_value_heads` can now use the GQA op and [run much faster on CPU](#20598).

### Description  optimize the GQA implementation on CPU. Mainly optimization are: 1. compute attention on real total sequence length instead of maximum sequence length in case past/present share same buffer 2. remove the mask 3. remove the transpose after attention x value It improve the phi3 model https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py with max sequence length 2k/4k from 10 tps to 20 tps. ### Motivation and Context

yufenglee added 6 commits May 7, 2024 07:48

only compute really length for score x value

52f6f0c

remove unnessasary transpose in score x v

2166dfa

use GemmEx for q x k

24d0c94

compute softmax together with matmul

e677919

remove mask

1754950

only compute real length for q x k

b25c3ef

yufenglee added the release:1.18.0 label May 7, 2024

yufenglee added 4 commits May 7, 2024 13:21

clean up code

c577a70

clean up softmax

38a9de4

add support of local window

a52f048

cleanup code

0c67354

yufenglee marked this pull request as ready for review May 7, 2024 21:47

Merge branch 'main' into yufeng/gqa_cpu_opt

36df9b7

sophies927 added the triage:approved Approved for cherrypicks for release label May 7, 2024

yufenglee added 3 commits May 7, 2024 18:54

fix merge error

3590796

fix unit test

ef07314

reduce the gqa_test in pipeline

1690826

tianleiwu reviewed May 8, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Show resolved Hide resolved

tianleiwu reviewed May 8, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h Show resolved Hide resolved

tianleiwu approved these changes May 8, 2024

View reviewed changes

yufenglee merged commit 156d521 into main May 8, 2024
90 of 94 checks passed

yufenglee deleted the yufeng/gqa_cpu_opt branch May 8, 2024 17:42

yihonglyu added the cherry-picked Cherry-picked for a cherrypicks branch label May 9, 2024

yihonglyu added the rel-merged Cherrypicks merged into release label May 10, 2024

kunal-vaishnavi mentioned this pull request May 18, 2024

Add GQA on CPU in LLaMA scripts #20720

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize gqa cpu #20598

optimize gqa cpu #20598

yufenglee commented May 7, 2024 •

edited

Loading

optimize gqa cpu #20598

optimize gqa cpu #20598

Conversation

yufenglee commented May 7, 2024 • edited Loading

Description

Motivation and Context

yufenglee commented May 7, 2024 •

edited

Loading