clBLAS to MIOpenGEMM

The clBLAS API for sgemm is,

clblasStatus
clblasSgemm(
    clblasOrder order,
    clblasTranspose transA,
    clblasTranspose transB,
    size_t M,
    size_t N,
    size_t K,
    cl_float alpha,
    const cl_mem A,
    size_t offA,
    size_t lda,
    const cl_mem B,
    size_t offB,
    size_t ldb,
    cl_float beta,
    cl_mem C,
    size_t offC,
    size_t ldc,
    cl_uint numCommandQueues,
    cl_command_queue *commandQueues,
    cl_uint numEventsInWaitList,
    const cl_event *eventWaitList,
    cl_event *events);

A loop of GEMMs using clBLAS might look like this:

for (int i = 0; i < 10; ++i){ 
  clblasStatus status = 
  clblasSgemm(
  // order, transA and transB are clBLAS enums
  order, transA, transB, M, N, K,
  
  // offsets, strides and memory buffers
  alpha, A, offA, lda, B, offB, ldb, beta, C, offC, ldc, 
  
  // clBLAS allows multiple cl_command_queues
  n_queues, queues, n_waitlist, waitlist, events
  );
}

The equivalent code using MIOpenGEMM might look like this,

for (int i = 0; i < 10; ++i){
  auto stat = MIOpenGEMM::gemm0<float>(
  isColMajor, tA, tB, M, N, K, 

  alpha, A, offA, lda, B, offB, ldb, beta, C, offC, ldc, 

  &queues[0], n_waitlist, waitlist, &events[0]);
}

If the matrices A, B and C are very small (< 100x100), then there is another slightly faster API function xgemm, which has less host-side overhead. Using xgemm,

// First, a "warm-up" call for this GEMM geometry, for 
// generating kernel source string, compiling and getting ID.    
auto stat = MIOpenGEMM::xgemm<float>(
// isColMajor, tA and tB are now bool
isColMajor, tA, tB, M, N, K,

// unchanged from clBLAS
alpha, A, offA, lda, B, offB, ldb, beta, C, offC, ldc,

// assuming no workspace for now
nullptr,0,0, 

// MIOpenGEMM only allows 1 cl_command_queue
&queues[0], n_waitlist, waitlist, &events[0],  

// this is the first run with this GEMM geometry, so ID is negative  
-1);

// obtain the cache ID for this geometry from the returned GemmStatus object
int ID_for_this_geometry = stat.ID;

// Now run with the cached and compiled kernel
for (int i = 1; i < 10; ++i){
  stat = MIOpenGEMM::xgemm<float>(
  isColMajor, tA, tB, M, N, K, alpha, A, offA, lda, B, offB, ldb, beta, 
  C, offC, ldc, nullptr,0,0, &queues[0], n_waitlist, waitlist, &events[0],  
  ID_for_this_geometry);
}

The differences between clBLAS and MIOpenGEMM APIs are,

enums order, transA, transB are converted to bools isColMajor, tA, tB
(xgemm) three workspace parameters are added (which can safely be set to nullptr, 0, 0)
numEventsInWaitList is removed, arrays queues and events must have length 1.
(xgemm) ID is added (which can safely be set to -1)

Workspace can be used in certain cases to accelerate GEMM, especially when A and B are of significantly different size with leading dimension large powers of 2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clBLAS to MIOpenGEMM

Clone this wiki locally