Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE Gemm perf tuning #20541

Merged
merged 8 commits into from
May 6, 2024
Merged

MoE Gemm perf tuning #20541

merged 8 commits into from
May 6, 2024

Conversation

wangyems
Copy link
Contributor

@wangyems wangyems commented May 2, 2024

Description

This PR supports profiling and tuning MoE Gemm kernels in the very first run and store the best configuration to reuse in the following runs. The Gemm id (the key to the config map, int64_t) is determined by num_rows, gemm_n and gemm_k for each type.

First 32 bits are total_rows, next 16 bits are gemm_n, next 16 bits are gemm_k
int64_t key = total_rows;
key = key << 16 | gemm_n;
key = key << 16 | gemm_k;

Mixtral-fp16 on 2 A100 with tp=2. batch size = 1, seq_len = 1k

Prompt Token
before 138ms 16.4ms
after 100ms 13.9ms

Motivation and Context

@wangyems wangyems requested a review from tianleiwu May 2, 2024 02:24
@wangyems wangyems merged commit ae6195b into main May 6, 2024
94 of 95 checks passed
@wangyems wangyems deleted the wangye/moe_gemm_tuning branch May 6, 2024 21:40
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
### Description
<!-- Describe your changes. -->

This PR supports profiling and tuning MoE Gemm kernels in the very first
run and store the best configuration to reuse in the following runs. The
Gemm id (the key to the config map, int64_t) is determined by num_rows,
gemm_n and gemm_k for each type.

First 32 bits are total_rows, next 16 bits are gemm_n, next 16 bits are
gemm_k
int64_t key = total_rows;
key = key << 16 | gemm_n;
key = key << 16 | gemm_k;

Mixtral-fp16 on 2 A100 with tp=2. batch size = 1, seq_len = 1k
|  | Prompt | Token |
| :---         |     :---:      |          ---: |
| before   | 138ms     | 16.4ms    |
| after      | 100ms       | 13.9ms      |


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
poweiw pushed a commit to poweiw/onnxruntime that referenced this pull request Jun 25, 2024
### Description
<!-- Describe your changes. -->

This PR supports profiling and tuning MoE Gemm kernels in the very first
run and store the best configuration to reuse in the following runs. The
Gemm id (the key to the config map, int64_t) is determined by num_rows,
gemm_n and gemm_k for each type.

First 32 bits are total_rows, next 16 bits are gemm_n, next 16 bits are
gemm_k
int64_t key = total_rows;
key = key << 16 | gemm_n;
key = key << 16 | gemm_k;

Mixtral-fp16 on 2 A100 with tp=2. batch size = 1, seq_len = 1k
|  | Prompt | Token |
| :---         |     :---:      |          ---: |
| before   | 138ms     | 16.4ms    |
| after      | 100ms       | 13.9ms      |


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants