[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores? #1304

zwshan · 2024-01-16T07:00:46Z

What is your question?
In this website, there are many parameters, but may I ask if the parameters listed on this page are already all the valid ones?

thakkarV · 2024-01-16T16:33:22Z

No, they are not going to be a full set of parameters the API supports. Generally speaking, the set of all valid template parameters supported by any kernel is so huge due to combinatorial explosion that no amount of testing and cutlass library can generate all valid parameters.

zwshan · 2024-01-17T06:33:51Z

No, they are not going to be a full set of parameters the API supports. Generally speaking, the set of all valid template parameters supported by any kernel is so huge due to combinatorial explosion that no amount of testing and cutlass library can generate all valid parameters.

I have noticed that when performing matrix multiplication on the A100 machine, the computation speed for dimensions MNK set to 1024,150,256 and MNK set to 1024,1,256 is significantly slower compared to cublas. I have tried all the parameters listed on the following website, but I still can't match or exceed the performance of cublas. What should I do now?

zwshan · 2024-01-17T13:55:52Z

What needs to be added is that MNK means matrix multiplication of (M, K) * (N, K).

zwshan · 2024-01-17T13:56:21Z

Can you help me, please?@hwu36

mnicely · 2024-01-17T16:37:52Z

@zwshan There are no expectations that CUTLASS should match or exceed cuBLAS performance. The intent of CUTLASS is to provide developers with an additional tool to cuBLAS to explore functionality and requirements not currently supported by our libraries.

thakkarV · 2024-01-17T18:04:24Z

Also, you should not be using a GEMM kernel for a GEMV problem. We have a GEMV and batched GEMV implementation that are better suited for your problem shapes.

hwu36 · 2024-01-17T19:25:33Z

you could use nsight or nvprof to get the kernel name used by cublas. the kernel name has the information of the tile sizes used. then we can fine tune cutlass from the same tile sizes used by cublas.

zwshan · 2024-01-18T05:49:25Z

thank you all! I will try it now!

zwshan · 2024-01-25T07:50:06Z

Could you please tell me how to use gemv kernel in sm80 A100 device?

zwshan · 2024-01-25T07:50:52Z

@thakkarV @hwu36

you could use nsight or nvprof to get the kernel name used by cublas. the kernel name has the information of the tile sizes used. then we can fine tune cutlass from the same tile sizes used by cublas.

I profile it and find cublas use a gemv kernel

zwshan · 2024-01-25T07:52:00Z

I want to use gemv kernel like this way

  using ElementOutput = float;
  using ElementAccumulator = float;
  using ElementComputeEpilogue = ElementAccumulator;
  using RowMajor = cutlass::layout::RowMajor;
  using ColumnMajor = cutlass::layout::ColumnMajor;
  using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
      ElementOutput,                                    // <- data type of output matrix
      128 / cutlass::sizeof_bits<ElementOutput>::value, // <- This is the number of elements per vectorized memory access. For half precision, it's 8 elements. This becomes the vector width of math instructions in epilogue too
      ElementAccumulator,                               // <- data type of accumulator
      ElementComputeEpilogue>;                          // <- data type for alpha/beta in linear combination function
      using CutlassGemm1 = cutlass::gemm::device::Gemm<
  

      cutlass::tfloat32_t,                          // Data-type of A matrix
      RowMajor,                       // Layout of A matrix
      cutlass::tfloat32_t,                          // Data-type of B matrix
      ColumnMajor,                    // Layout of B matrix
      ElementOutput,                  // Data-type of C matrix
      ColumnMajor,                    // Layout of C matrix , LayoutC = layout::ColumnMajor;
      ElementAccumulator,             // ElementAccumulator
      cutlass::arch::OpClassTensorOp, // tag indicating Tensor Cores
      cutlass::arch::Sm80,            // tag indicating target GPU compute architecture
      cutlass::gemm::GemmShape<64, 64, 32>,
      cutlass::gemm::GemmShape<32, 32, 32>,
      cutlass::gemm::GemmShape<16, 8, 8>,
      cutlass::epilogue::thread::LinearCombination<
      ElementOutput,
      128 / cutlass::sizeof_bits<ElementOutput>::value,
      ElementAccumulator,
      ElementComputeEpilogue
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    6
  >; 

  CutlassGemm1 gemm_operator;

hwu36 · 2024-01-25T15:36:08Z

here is cutlass gemv example: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/gemv.cu

the code entrance is https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/device/gemv.h

mnicely · 2024-02-22T15:10:58Z

@zwshan has your issue been resolved?

zwshan · 2024-02-22T17:09:30Z

solved thank you Matthew Nicely ***@***.***>于2024年2月22日周四23:11写道：

…

@zwshan <https://github.com/zwshan> has your issue been resolved? — Reply to this email directly, view it on GitHub <#1304 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO2L3ITRU6NDFGCQOCCGCMTYU5NZBAVCNFSM6AAAAABB4JKZF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJZGY2TINBTGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zwshan added ? - Needs Triage question Question labels Jan 16, 2024

mnicely removed the ? - Needs Triage label Jan 17, 2024

mnicely closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores? #1304

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores? #1304

zwshan commented Jan 16, 2024

thakkarV commented Jan 16, 2024 •

edited

Loading

zwshan commented Jan 17, 2024

zwshan commented Jan 17, 2024

zwshan commented Jan 17, 2024 •

edited

Loading

mnicely commented Jan 17, 2024 •

edited

Loading

thakkarV commented Jan 17, 2024

hwu36 commented Jan 17, 2024 •

edited

Loading

zwshan commented Jan 18, 2024

zwshan commented Jan 25, 2024

zwshan commented Jan 25, 2024

zwshan commented Jan 25, 2024

hwu36 commented Jan 25, 2024

mnicely commented Feb 22, 2024

zwshan commented Feb 22, 2024 via email

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores? #1304

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores? #1304

Comments

zwshan commented Jan 16, 2024

thakkarV commented Jan 16, 2024 • edited Loading

zwshan commented Jan 17, 2024

zwshan commented Jan 17, 2024

zwshan commented Jan 17, 2024 • edited Loading

mnicely commented Jan 17, 2024 • edited Loading

thakkarV commented Jan 17, 2024

hwu36 commented Jan 17, 2024 • edited Loading

zwshan commented Jan 18, 2024

zwshan commented Jan 25, 2024

zwshan commented Jan 25, 2024

zwshan commented Jan 25, 2024

hwu36 commented Jan 25, 2024

mnicely commented Feb 22, 2024

zwshan commented Feb 22, 2024 via email

thakkarV commented Jan 16, 2024 •

edited

Loading

zwshan commented Jan 17, 2024 •

edited

Loading

mnicely commented Jan 17, 2024 •

edited

Loading

hwu36 commented Jan 17, 2024 •

edited

Loading