Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

Latest commit

 

History

History
66 lines (43 loc) · 2.3 KB

README.md

File metadata and controls

66 lines (43 loc) · 2.3 KB

Defunct

Development on this repo has been discontinued and the code is no longer supported.

Despite the generality of the interface, the performance achieved by this implementation of Kronecker multiplication is insufficient to address the problems of interest to the main ASGarD project. The code here has been superseded by series of kernels included into the main ASGarD library and linked directly with the relevant data-structures.

Old description and instructions below:

Code to perform 6-dimensional batched kronecker product on GPU and CPU (using OpenMP).

Y(:,k) += kron(A1(:,:,k), ..., A6(:,:,k) ) * X(:,k), where k=1:batchCount

The code assumes each matrix A1, ..., A6 are all square and same shape of n by n.

Note there can be overlap in the output vectors Y(:,k). Atomic add updates will be performed.

Each vector X(:,k) or Y(:,k) is conceptually of shape n^6 by 1 but during the computation can be reshaped as n^5 by n or n^4 by n^2 or n^3 by n^3 or n^2 by n^4 or n by n^5.

Note that
Y = kron(A1,...,A6) * X can be evaluated as step 1: W = reshape( X, [n^5,n]) * transpose(A1) step 2: Y = kron( A2, ..., A6) * W, which can be viewed as "tail recursion"

At the lowest level of recursion Y = kron(A1) * X is simply implemented as matrix multiply Y = A1 * X

This implementation evaluates each kronecker product in a separate thread block on GPU. Instead of building a long batch list to call batched GEMM, the matrix-matrix multiplication is evaluated as calls to device functions

kgemm_nn() to evaluate C = alpha * A * B + beta * C or kgemm_nt() to evaluate C = alpha * A * transpose(B) + beta * C

For the special case of (beta == 1), atomicAdd update is used.

Note that the GEMM operations will be performed on very slender rectangular matrices. Therefore, the computations will not be dominated by floating point operations but by data movement, especially when n is small.


To compile the code for CPU (1) mkdir build && cd build (2) cmake ../ (3) make

To compile the code for Nvidia GPU (1) mkdir build && cd build (2) cmake ../ -DUSE_GPU=1 (3) make

To run the tester for kgemm_nn_batched, perform ./test_kgemm_nn_batched

To run the tester for kgemm_nt_batched, perform ./test_kgemm_nt_batched

To run the tester for kronmult6_batched, perform ./test_kronmult6_batched