[QST] How do I do a matrix multiply inside a Cuda kernel? #1291

mrakgr · 2024-01-04T16:36:56Z

What is your question?

I want to implement a game + ML library directly on the GPU without any of the inefficiencies of going back and forth from host. But as far as I can tell, most of the Cutlass examples are calling from host. I thought that the kernel API would allow me to call from the device. But it doesn't seem like that is the case.

The one example that I found is calling from host.

I spent some time studying the examples, but they are all extremely complex, so that's why I'm asking here. How do I do the equivalent of using the BLAS Gemm inside the kernel?

thakkarV · 2024-01-04T19:40:36Z

CUTLASS provides utilities to implement linear algebra at all layers of the GPU hierarchy. it is certainly possible to "fuse" GEMM into some existing kernel, however, it requires more work that using a GEMM kernel off the shelf. Can you describe your problem a bit more in detail? What is your problem shape, do you know it at compile time? data types? What are the layouts of input tensors and do you have control over them? what memory space do these input tensors live in? What architecture are you targeting? What are your perf goals? the more details you provide, the more we can help you.

mrakgr · 2024-01-04T20:18:52Z

I made this diagram to make it easier to understand. I want Cutlass in order to implement the matrix multiplies needed for the neural net model. What I had in mind at the start of the program is to launch a bunch of grids and blocks. A minority of them would be dedicated to running the game itself. The rest of them would go past the game loop into the NN model loop.

They would wait at a device wide barrier. They would be joined by the game threads when they request an action and the neural net model would execute. At the end of it, the game threads would return from the function invocation with the results, resuming the game, and the NN threads would go back to waiting on the barrier.

With this kind of architecture, it should be possible to dedicate the entirety of the resources of the GPU to the neural net model. I think this is the way RL research should be done in the future.

mrakgr · 2024-01-04T20:29:44Z

What is your problem shape, do you know it at compile time? data types? What are the layouts of input tensors and do you have control over them? what memory space do these input tensors live in?

They will all be statically known, both the batch size, and the input and the output dimensions. The data types would be floats, also statically known. I'll have experiment to see which ones work best. As for memory space, they should be in global memory.

What architecture are you targeting?

A RTX 4060 for now. I got it just so I could program this.

What are your perf goals? the more details you provide, the more we can help you.

Hopefully something far better than doing the game in Python and using PyTorch for a NN model. That was slow as snail. I don't know exactly how many hands per second I'll get until I try this, but the PyTorch version that I did in 2021 was on the order of 10s of thousands of them per seconds, which was pretty horrible. This was with various kinds of models, so I think I was being bottlenecked by the bandwidth of transferring data between the CPU and the GPU.

mrakgr · 2024-01-04T20:37:50Z

What are the layouts of input tensors and do you have control over them?

I forgot to address this. Basically, everything is known and controllable at compile time.

mnicely · 2024-01-05T16:21:44Z

If you need GEMM in the CUDA kernel, you may want to investigate using cuBLASDx. It should be available through early access very soon. Here is a link to the Math Libraries Early Access page. It sounds like your use case meets all the requirements. Take a look at the cuFFTDx documentation to get an idea of the API.

mrakgr · 2024-01-05T16:32:33Z

I heard about that a month ago, but I left it out of her mind because it wasn't out yet. I'll check it out again. Thanks!

thakkarV · 2024-01-08T16:05:31Z

Ok, so you just want to call an entire GEMM kernel but from within a CUDA kernel that you have. I think you can just stamp out a cutlass::gemm::kernel::GemmUniversal and then call its operator() in your code.

mrakgr · 2024-01-08T16:09:01Z

An example would make that a lot easier. The Cutlass library right now is not for the faint of heart.

thakkarV · 2024-01-08T16:13:26Z

I unfortunatley do not have the time to write a full example, but I can try to stub something out for you here. the kernle layer gemm is simply the __global__ entry functor but to have a single launch interface we mark it as device only and then invoke it with a basically emtpy global entry point. This can be found here: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/device_kernel.h#L64

What I am suggesting is that you hijack this function, and use everything else off the shelf. you can run whatever code you want along side the kernel if you write something simlar

/// Generic CUTLASS kernel template.
template <typename Operator>
__global__
void Kernel(typename Operator::Params params) {
  // Dynamic shared memory base pointer
  extern __shared__ int SharedStorageBase[];
  // Declare pointer to dynamic shared memory.
  typename Operator::SharedStorage *shared_storage =
      reinterpret_cast<typename Operator::SharedStorage *>(SharedStorageBase);

  Operator op;
  while(game_loop) {
    // do whatever before
    op(params, *shared_storage);
    // do whatever after
  }
}

mrakgr · 2024-01-08T16:19:18Z

Thanks. I'll give it a try. Let me just first get the Spiral tensors out of the way. I am not doing this in C++ so there is more work needed to get the fundamentals in place.

mrakgr · 2024-01-09T14:21:54Z

  extern __shared__ int SharedStorageBase[];
  // Declare pointer to dynamic shared memory.
  typename Operator::SharedStorage *shared_storage =
      reinterpret_cast<typename Operator::SharedStorage *>(SharedStorageBase);

I just want to ask, ahead of time before I try doing it. It feels like in this fragment, there is missing information. How will the gemm kernel know what the size of the shared storage is?

hwu36 · 2024-01-09T14:31:07Z

we use dynamic shared memory allocation. check this blog https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

dynamic shared memory allocation is required when the allocated shared memory is big (IIRC the bar is 48KB)

mrakgr · 2024-01-09T15:00:43Z

I know what dynamic shared memory is. The only problem is that as written, the kernel won't know what the size of the shared array is.

...Nevermind my question. Now that I think about it, the size of the dynamically allocated shared memory is probably a part of params.

hwu36 · 2024-01-09T15:42:22Z

when you launch the kernel, you need to put the shared memory size inside <<<...>>>

mrakgr · 2024-01-19T10:45:02Z

Sorry for the long delay, I managed to implement the tensor functionality in Spiral which will be necessary for passing the multidimensional arrays from the Python backend to the Cuda one, and now I am finally ready to start work on the ML library. Finding a matrix multiply that could work on the device is now my main priority. I registered for the Early Access for the CUDA Math library two weeks, but didn't get an answer so I am now looking at the Cutlass library again.

As expected, I am having difficulty getting started.

The basic device API example that Cutlass provides makes a lot of sense and resembles a CUBLAS call which I am familiar with.

But the device API is a lot more complex.

cutlass/include/cutlass/gemm/kernel/gemm_universal.h

Line 60 in b4b5b11

template <

template <
  typename Mma_,                  ///! Threadblock-scoped matrix multiply-accumulate
  typename Epilogue_,             ///! Epilogue
  typename ThreadblockSwizzle_    ///! Threadblock swizzling function
>
class GemmUniversal<
  Mma_,
  Epilogue_,
  ThreadblockSwizzle_,
  void,
  // 3.x kernels use the first template argument to define the ProblemShape
  // We use this invariant to SFINAE dispatch against either the 2.x API or the 3.x API
  cute::enable_if_t<not (cute::is_tuple<Mma_>::value || IsCutlass3ArrayKernel<Mma_>::value)>
> {

As was suggested, I need to first figure out how to stamp one of these out. What should I be passing in for Mma_. Epilogue_, ThreadblockSwizzle_? Also, I am confused what the template parameters are doing to the right of class in the above fragment. The stuff between the declaration and the body, I mean: GemmUniversal<Mma_, Epilogue_, ThreadblockSwizzle_, void, cute::enable_if_t<not (cute::is_tuple<Mma_>::value || IsCutlass3ArrayKernel<Mma_>::value)>. I didn't know this was valid C++ and I can't find anything about this in the class template reference page.

Secondly...

cutlass/include/cutlass/gemm/kernel/gemm_universal.h

Line 158 in b4b5b11

Arguments(

    Arguments(
      GemmUniversalMode mode,
      GemmCoord problem_size,
      int batch_count,
      typename EpilogueOutputOp::Params epilogue,
      void const * ptr_A,
      void const * ptr_B,
      void const * ptr_C,
      void * ptr_D,
      int64_t batch_stride_A,
      int64_t batch_stride_B,
      int64_t batch_stride_C,
      int64_t batch_stride_D,
      typename LayoutA::Stride stride_a,
      typename LayoutB::Stride stride_b,
      typename LayoutC::Stride stride_c,
      typename LayoutC::Stride stride_d,
      int const *ptr_gather_A_indices = nullptr,
      int const *ptr_gather_B_indices = nullptr,
      int const *ptr_scatter_D_indices = nullptr)

I need a reference for how to construct the Arguments struct. There are indices, batch strides, regular strides, mode, problem size, batch count...

    /// Constructor
    Params(
      Arguments const &args,  /// GEMM application arguments
      int device_sms,         /// Number of SMs on the device
      int sm_occupancy)       /// Kernel SM occupancy (in thread blocks)
    :

...In order to make this struct.

mrakgr · 2024-01-19T11:07:40Z

One other thing. Looking at the Arguments class, it does not seem that the size of the extern shared memory is being passed in as an argument. Shouldn't that be done somewhere? Otherwise, how will the gemm operator know how much shared storage we are allowing it?

I have to reraise this question, as my earlier assumption that it'd be passed as an argument wasn't valid.

mrakgr · 2024-01-19T11:10:08Z

I am also concerned, is using an reinterpret_cast to make that SharedStorage class pointer really the right way to go about it?

mrakgr · 2024-01-19T15:34:52Z

https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm80_gemm_f16_f16_f32_tensor_op_f32.cu

I found this in a different thread and it is a good example of how to construct a GemmUniversal kernel.

mrakgr · 2024-01-19T15:49:52Z

That having said, as things are, I am going to dig out a regular matrix multiply that uses shared memory and use that instead of Cutlass. I just can't afford to keep using all the time that should be going into building the ML library trying to figure out Cutlass instead. As it is a C++ project, since the authors have their own setup, I can't open it in something like VS to get the types via Intellisense or try running the examples and study them through the debugger.

If the authors want to provide a concrete example of how to do what I asked in this thread, that would be appreciated, but I'll just let the issue linger for the time being.

thakkarV · 2024-01-19T18:00:22Z

Do you want to adopt the 2.x API or 3.x API? It mostly depends on the architectures you are targeting. For pre-Hopper, I would recommend 2.x. As for how to generate the right set of kernels, I would encourage you to look at how cutlass library generates kernel configurations and emits the corresponding C++ to stamp out kernels. You can also lift the entire manifest type from the CUTLASS library as a handle to all kernels into your application.

thakkarV · 2024-01-19T18:01:17Z

As for shared memory, the kernels in CUTLASS only ever work with a static amount of shared memory that the kernel knows about based on the template config. The kernel must be launched with that much dynamic smem, otherwise it will not run correctly. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/device/gemm_universal_adapter.h#L237

mrakgr · 2024-01-19T18:28:19Z

As for shared memory, the kernels in CUTLASS only ever work with a static amount of shared memory that the kernel knows about based on the template config. The kernel must be launched with that much dynamic smem, otherwise it will not run correctly. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/device/gemm_universal_adapter.h#L237

If the amount of shared memory needed is known by the Cutlass kernel template at compile time, then instead of declaring dynamic shared memory arrays, wouldn't it be better to declare static ones instead?

Do you want to adopt the 2.x API or 3.x API? It mostly depends on the architectures you are targeting. For pre-Hopper, I would recommend Ampere. As for how to generate the right set of kernels, I would encourage you to look at how cutlass library generates kernel configurations and emits the corresponding C++ to stamp out kernels. You can also lift the entire manifest type from the CUTLASS library as a handle to all kernels into your application.

This would take a ton of time and I can't realistically run the huge C++ project examples in an IDE to step through the and get a sense of the program flow as well as the types since I am doing my development on Windows. The question of which version of the API to use is largely irrelevant to me, if there were good examples on how to use the kernel API for my use case, I'd use either one.

What I'd really want is a BLAS-like API for Cutlass, that could actually be used on the device. I do not want to pick tile sizes or deal with similar kinds of low-level details. I am not interested in researching matrix multiplication itself but doing reinforcement learning purely on the GPU.

Given my circumstances it would be a lot faster to pick the Cuda matmult sample, translate it into Spiral, and use that while I wait for CuBLASDx to be released.

Putting in a lot more effort to make Cutlass work would only be something I'd be willing do if CuBLASDx is not out by the time I've built out everything else. It will take me a few months.

thakkarV · 2024-01-19T18:32:48Z

If the amount of shared memory needed is known by the kernel at compile time, then instead of declaring dynamic shared memory arrays, wouldn't it be better to declare static ones instead?

No, because this often exceeds the portable 48K or so of smem, and we need to use the driver API to opt into larger smem carveouts which requires using dynamic smem.

What I'd really want is a BLAS-like API for Cutlass, that could actually be used on the device

If you look at the runtime kernel::GemmUniversal API, that is basically what it is. In your case, you likely just need a single kernel for your specific usecase rather than a library of many kernels. That is why I am suggesting to look through the profiler to pick the best template arguments for your specific arch,dtype,layouts etc and just copy paste that config into your application. Then you can just call the kernel __device__ operator() from your __global__ routine

cuBLASDx is designed to fuse much smaller matmuls into an existing program and is intended for use cases where you have small matmuls you can run out of shared memory. Is that the case for you? Sounds like you have large GEMMs to run instead of small matmuls.

mrakgr · 2024-01-20T15:30:20Z

It seems that CuBLASDx preview is already out, so I'll close this issue. Thank you for taking the time to answer some of my questions.

mnicely · 2024-02-22T14:43:16Z

cuBLASDx examples have been posted https://github.com/NVIDIA/CUDALibrarySamples/tree/master/MathDx

mrakgr · 2024-03-16T14:43:59Z

Rather than using either Cutlass or cuBLASDx, I ended spending the last two months building my own matrix multiply kernel as a part of the Spiral playlist. I think I did well, but the only thing that I couldn't quite get down is the async loads for the Ampere. No matter what I did, trying to interleave loads with the hmma instructions ended blowing up the register count, and even when it didn't spill over with smaller shared block sizes I couldn't get any performance improvement from that. The synchronous loading version ended up being significantly faster as a result.

I wonder if there is something I don't understand about how to use async loads?

I tried cuBLASDx, but for some reason the performance of it was horrible, and even without that, I thought that I could have done better (and I have) due to its interface constraints.

There is an cuBLASDx example here and the compiled output in case @mnicely wants to have a look,

hwu36 · 2024-03-20T19:08:03Z

your ptx needs to be very similar to cutlass ones.

mrakgr · 2024-03-20T19:22:46Z

Hmmm, that actually isn't a bad idea, I mean to use Cutlass as a reference. I've only been using the CuPy matmult (from the cuBLAS library) as reference and it doesn't use tensor cores. It didn't use async loads when I looked into it either, though.

That having said, I'll just upgrade my kernel so it uses the TMA and bulk async transfers when the 50xx cards come out and that will take care of the memory transfer issues.

mrakgr added ? - Needs Triage question Question labels Jan 4, 2024

thakkarV removed the ? - Needs Triage label Jan 4, 2024

mrakgr closed this as completed Jan 20, 2024

mrakgr mentioned this issue Feb 4, 2024

Invalid image error when compiling a larger example with NVRTC cupy/cupy#8156

Closed

mrakgr mentioned this issue Sep 13, 2024

[QST] How do I import the cutlass::gemm::kernel::GemmUniversal #1814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How do I do a matrix multiply inside a Cuda kernel? #1291

[QST] How do I do a matrix multiply inside a Cuda kernel? #1291

mrakgr commented Jan 4, 2024

thakkarV commented Jan 4, 2024 •

edited

Loading

mrakgr commented Jan 4, 2024

mrakgr commented Jan 4, 2024 •

edited

Loading

mrakgr commented Jan 4, 2024 •

edited

Loading

mnicely commented Jan 5, 2024

mrakgr commented Jan 5, 2024

thakkarV commented Jan 8, 2024

mrakgr commented Jan 8, 2024

thakkarV commented Jan 8, 2024

mrakgr commented Jan 8, 2024 •

edited

Loading

mrakgr commented Jan 9, 2024

hwu36 commented Jan 9, 2024

mrakgr commented Jan 9, 2024 •

edited

Loading

hwu36 commented Jan 9, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

thakkarV commented Jan 19, 2024 •

edited

Loading

thakkarV commented Jan 19, 2024 •

edited

Loading

mrakgr commented Jan 19, 2024 •

edited

Loading

thakkarV commented Jan 19, 2024 •

edited

Loading

mrakgr commented Jan 20, 2024

mnicely commented Feb 22, 2024

mrakgr commented Mar 16, 2024

hwu36 commented Mar 20, 2024

mrakgr commented Mar 20, 2024

[QST] How do I do a matrix multiply inside a Cuda kernel? #1291

[QST] How do I do a matrix multiply inside a Cuda kernel? #1291

Comments

mrakgr commented Jan 4, 2024

thakkarV commented Jan 4, 2024 • edited Loading

mrakgr commented Jan 4, 2024

mrakgr commented Jan 4, 2024 • edited Loading

mrakgr commented Jan 4, 2024 • edited Loading

mnicely commented Jan 5, 2024

mrakgr commented Jan 5, 2024

thakkarV commented Jan 8, 2024

mrakgr commented Jan 8, 2024

thakkarV commented Jan 8, 2024

mrakgr commented Jan 8, 2024 • edited Loading

mrakgr commented Jan 9, 2024

hwu36 commented Jan 9, 2024

mrakgr commented Jan 9, 2024 • edited Loading

hwu36 commented Jan 9, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

mrakgr commented Jan 19, 2024

thakkarV commented Jan 19, 2024 • edited Loading

thakkarV commented Jan 19, 2024 • edited Loading

mrakgr commented Jan 19, 2024 • edited Loading

thakkarV commented Jan 19, 2024 • edited Loading

mrakgr commented Jan 20, 2024

mnicely commented Feb 22, 2024

mrakgr commented Mar 16, 2024

hwu36 commented Mar 20, 2024

mrakgr commented Mar 20, 2024

thakkarV commented Jan 4, 2024 •

edited

Loading

mrakgr commented Jan 4, 2024 •

edited

Loading

mrakgr commented Jan 4, 2024 •

edited

Loading

mrakgr commented Jan 8, 2024 •

edited

Loading

mrakgr commented Jan 9, 2024 •

edited

Loading

thakkarV commented Jan 19, 2024 •

edited

Loading

thakkarV commented Jan 19, 2024 •

edited

Loading

mrakgr commented Jan 19, 2024 •

edited

Loading

thakkarV commented Jan 19, 2024 •

edited

Loading