[FEA] Support SM90 Grouped GEMM #1280

imoneoi · 2023-12-26T06:00:10Z

Is your feature request related to a problem? Please describe.
Grouped GEMM using cutlass is ~30% slower than a for-loop with cuBLAS GEMM on SM90 (H100). Implementation of grouped GEMM using cutlass and cuBLAS can be found here https://github.com/tgale96/grouped_gemm/blob/main/csrc/grouped_gemm.cu

Describe the solution you'd like
Consider adding SM90 support to the Grouped GEMM kernel in cutlass. It's currently using SM80. Grouped GEMM is important for training MoE models.

thakkarV · 2023-12-26T15:39:44Z

Initial grouped GEMM for hopper is releasing imminently with 3.4 in the coming week or two

mnicely · 2024-01-02T15:26:39Z

Grouped GEMM for Hopper was added last week. We will be tagging v3.4 soon.

imoneoi added ? - Needs Triage feature request New feature or request labels Dec 26, 2023

mnicely added this to the CUTLASS 3.4 milestone Jan 2, 2024

mnicely closed this as completed Jan 2, 2024

mnicely removed the ? - Needs Triage label Jan 2, 2024

imoneoi mentioned this issue Jan 4, 2024

Support SM90 imoneoi/cutlass_grouped_gemm#1

Open

dfyz mentioned this issue Jul 5, 2024

Use CUTLASS for both trans_a and trans_b on Ampere tgale96/grouped_gemm#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support SM90 Grouped GEMM #1280

[FEA] Support SM90 Grouped GEMM #1280

imoneoi commented Dec 26, 2023 •

edited

Loading

thakkarV commented Dec 26, 2023

mnicely commented Jan 2, 2024

[FEA] Support SM90 Grouped GEMM #1280

[FEA] Support SM90 Grouped GEMM #1280

Comments

imoneoi commented Dec 26, 2023 • edited Loading

thakkarV commented Dec 26, 2023

mnicely commented Jan 2, 2024

imoneoi commented Dec 26, 2023 •

edited

Loading