Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]Is this the complete set of valid parameters for performing fp16 matrix multiplication using tensor cores? #1304

Closed
zwshan opened this issue Jan 16, 2024 · 14 comments
Labels
question Question

Comments

@zwshan
Copy link

zwshan commented Jan 16, 2024

What is your question?
In this website, there are many parameters, but may I ask if the parameters listed on this page are already all the valid ones?

@thakkarV
Copy link
Collaborator

thakkarV commented Jan 16, 2024

No, they are not going to be a full set of parameters the API supports. Generally speaking, the set of all valid template parameters supported by any kernel is so huge due to combinatorial explosion that no amount of testing and cutlass library can generate all valid parameters.

@zwshan
Copy link
Author

zwshan commented Jan 17, 2024

No, they are not going to be a full set of parameters the API supports. Generally speaking, the set of all valid template parameters supported by any kernel is so huge due to combinatorial explosion that no amount of testing and cutlass library can generate all valid parameters.

I have noticed that when performing matrix multiplication on the A100 machine, the computation speed for dimensions MNK set to 1024,150,256 and MNK set to 1024,1,256 is significantly slower compared to cublas. I have tried all the parameters listed on the following website, but I still can't match or exceed the performance of cublas. What should I do now?

@zwshan
Copy link
Author

zwshan commented Jan 17, 2024

What needs to be added is that MNK means matrix multiplication of (M, K) * (N, K).

@zwshan
Copy link
Author

zwshan commented Jan 17, 2024

Can you help me, please?@hwu36

@mnicely
Copy link
Collaborator

mnicely commented Jan 17, 2024

@zwshan There are no expectations that CUTLASS should match or exceed cuBLAS performance. The intent of CUTLASS is to provide developers with an additional tool to cuBLAS to explore functionality and requirements not currently supported by our libraries.

@thakkarV
Copy link
Collaborator

Also, you should not be using a GEMM kernel for a GEMV problem. We have a GEMV and batched GEMV implementation that are better suited for your problem shapes.

@hwu36
Copy link
Collaborator

hwu36 commented Jan 17, 2024

you could use nsight or nvprof to get the kernel name used by cublas. the kernel name has the information of the tile sizes used. then we can fine tune cutlass from the same tile sizes used by cublas.

@zwshan
Copy link
Author

zwshan commented Jan 18, 2024

thank you all! I will try it now!

@zwshan
Copy link
Author

zwshan commented Jan 25, 2024

Could you please tell me how to use gemv kernel in sm80 A100 device?

@zwshan
Copy link
Author

zwshan commented Jan 25, 2024

@thakkarV @hwu36

you could use nsight or nvprof to get the kernel name used by cublas. the kernel name has the information of the tile sizes used. then we can fine tune cutlass from the same tile sizes used by cublas.

I profile it and find cublas use a gemv kernel

@zwshan
Copy link
Author

zwshan commented Jan 25, 2024

I want to use gemv kernel like this way

  using ElementOutput = float;
  using ElementAccumulator = float;
  using ElementComputeEpilogue = ElementAccumulator;
  using RowMajor = cutlass::layout::RowMajor;
  using ColumnMajor = cutlass::layout::ColumnMajor;
  using EpilogueOp = cutlass::epilogue::thread::LinearCombination<
      ElementOutput,                                    // <- data type of output matrix
      128 / cutlass::sizeof_bits<ElementOutput>::value, // <- This is the number of elements per vectorized memory access. For half precision, it's 8 elements. This becomes the vector width of math instructions in epilogue too
      ElementAccumulator,                               // <- data type of accumulator
      ElementComputeEpilogue>;                          // <- data type for alpha/beta in linear combination function
      using CutlassGemm1 = cutlass::gemm::device::Gemm<
  

      cutlass::tfloat32_t,                          // Data-type of A matrix
      RowMajor,                       // Layout of A matrix
      cutlass::tfloat32_t,                          // Data-type of B matrix
      ColumnMajor,                    // Layout of B matrix
      ElementOutput,                  // Data-type of C matrix
      ColumnMajor,                    // Layout of C matrix , LayoutC = layout::ColumnMajor;
      ElementAccumulator,             // ElementAccumulator
      cutlass::arch::OpClassTensorOp, // tag indicating Tensor Cores
      cutlass::arch::Sm80,            // tag indicating target GPU compute architecture
      cutlass::gemm::GemmShape<64, 64, 32>,
      cutlass::gemm::GemmShape<32, 32, 32>,
      cutlass::gemm::GemmShape<16, 8, 8>,
      cutlass::epilogue::thread::LinearCombination<
      ElementOutput,
      128 / cutlass::sizeof_bits<ElementOutput>::value,
      ElementAccumulator,
      ElementComputeEpilogue
    >,
    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
    6
  >; 

  CutlassGemm1 gemm_operator;

@hwu36
Copy link
Collaborator

hwu36 commented Jan 25, 2024

@mnicely
Copy link
Collaborator

mnicely commented Feb 22, 2024

@zwshan has your issue been resolved?

@zwshan
Copy link
Author

zwshan commented Feb 22, 2024 via email

@mnicely mnicely closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

4 participants