Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize gqa cpu #20598

Merged
merged 14 commits into from
May 8, 2024
Merged

optimize gqa cpu #20598

merged 14 commits into from
May 8, 2024

Conversation

yufenglee
Copy link
Member

@yufenglee yufenglee commented May 7, 2024

Description

optimize the GQA implementation on CPU. Mainly optimization are:

  1. compute attention on real total sequence length instead of maximum sequence length in case past/present share same buffer
  2. remove the mask
  3. remove the transpose after attention x value

It improve the phi3 model https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py with max sequence length 2k/4k from 10 tps to 20 tps.

Motivation and Context

@yufenglee yufenglee marked this pull request as ready for review May 7, 2024 21:47
@sophies927 sophies927 added the triage:approved Approved for cherrypicks for release label May 7, 2024
@yufenglee yufenglee merged commit 156d521 into main May 8, 2024
90 of 94 checks passed
@yufenglee yufenglee deleted the yufeng/gqa_cpu_opt branch May 8, 2024 17:42
@yihonglyu yihonglyu added the cherry-picked Cherry-picked for a cherrypicks branch label May 9, 2024
yihonglyu pushed a commit that referenced this pull request May 9, 2024
### Description
<!-- Describe your changes. -->
optimize the GQA implementation on CPU. Mainly optimization are:
1. compute attention on real total sequence length instead of maximum
sequence length in case past/present share same buffer
2. remove the mask
3. remove the transpose after attention x value

It improve the phi3 model
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py
with max sequence length 2k/4k from 10 tps to 20 tps.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@yihonglyu yihonglyu added the rel-merged Cherrypicks merged into release label May 10, 2024
hanbitmyths pushed a commit that referenced this pull request May 18, 2024
### Description
This PR adds support for adding GroupQueryAttention (GQA) in models that
are running on CPU.

### Motivation and Context
Previously, the LLaMA scripts supported creating models that have GQA
for CUDA only. With the recently added support for [GQA on
CPU](#20299), models where
`num_attention_heads != num_key_value_heads` can now use the GQA op and
[run much faster on
CPU](#20598).
poweiw pushed a commit to poweiw/onnxruntime that referenced this pull request Jun 25, 2024
### Description
<!-- Describe your changes. -->
optimize the GQA implementation on CPU. Mainly optimization are:
1. compute attention on real total sequence length instead of maximum
sequence length in case past/present share same buffer
2. remove the mask
3. remove the transpose after attention x value

It improve the phi3 model
https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py
with max sequence length 2k/4k from 10 tps to 20 tps.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked Cherry-picked for a cherrypicks branch rel-merged Cherrypicks merged into release release:1.18.0 triage:approved Approved for cherrypicks for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants