Skip to content
/ LargeMM Public

A CUBLAS‐CUDA Based Implementation of Multi-GPU Large Matrix Multiplication

Notifications You must be signed in to change notification settings

Zlisch/LargeMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LargeMM

A CUBLAS‐CUDA based implementation of multi-GPU large matrix multiplication. It is a standalone C/C++ commandline application to lauch large matrix-matrix multiplicatio and get profiled outputs in a GPU cluster. It can be easily transformed into a C/C++ library given its lightweight codebase.

Built With

  • CUDA - A parallel computing platform and programming model developed by NVIDIA for GPU-accelerated computing.
  • cuBLAS - The NVIDIA CUDA Basic Linear Algebra Subprograms (cuBLAS) library for efficient GPU-accelerated linear algebra operations.

Table of Contents

Dependencies

The LargeMM application relies on the following dependencies:

Dependency Version
CUDA 11.6.1+
GCC 10.3.0+
CMake 3.24.2+

CUDA modules should be loaded prior to compilation or execution.

Environment

This application is designed to run on 1-4 Tesla V100 SXM2. The default environment is a GPU node in Gadi.

Important Files

  1. data folder stores performance data of LargeMM.

  2. profile folder stores profiler timeline files for the performance of v2_ngpus_reduction, v1_1_n_streams, and base_cublasDgemm.

  3. test folder stores tests for v2_ngpus_reduction, v1_1_n_streams, and base_cublasDgemm.

Installation

  1. Clone the repository into your workspace and navigate to the project directory:

    git clone https://github.com/Zlisch/LargeMM.git
    cd LargeMM
  2. Run the installation script:

    chmod -x ./INSTALL.sh
    ./INSTALL.sh

Or you can directly download the latest executable from the link.

Documentation

You can either view the documentation in header files of the cloned repository or if you are using Visual Studio Code,

  1. Install the Live Server extension in your Visual Studio Code. To enable Live Server, cmd+shift+p in your Visual Studio Code and type live server in the prompt. Select open with live server.

  2. With the Live Server extension enabled, enter http://127.0.0.1:5500/docs/html/globals.html in your browser to view the documentation.

Running the Application

After running ./INSTALL.sh, use the following to run v2_ngpus_reduction with lookup table on 4 GPUs and print the output.

./bin/largemm -s "-1" -m 28377 -a 2 -g 4

To run the LargeMM with NVIDIA visual profiler, use:

nsys profile --stats=true ./bin/largemm -s "-1" -m 28377 -a 2 -g 4

Or you can build your own run script. A run script template is provided in ./run.sh.

Available Options

  1. -s
  • Description: Specify the stream stride (square root of the number of streams to be used) for each GPU. If -1 is given, the lookup table will be used instead to decide the number of streams for each GPU.
  • Example: Run v2_ngpus_reduction with 9 streams for each GPU on 4 GPUs and print the output.
./bin/largemm -s 3 -m 28377 -a 2 -g 4
  1. -a
  • Description: Specify the algorithm to run.
Value Algorithm Version
0 base_cublasDgemm
1 v1_1_n_streams
2 v2_ngpus_reduction
3 v2_ngpus_parallel_a
4 v2_ngpus_parallel_a_n_streams_breadth
  1. -m
  • Description: Row dimension of the matrix.
  • Example: Run v2_ngpus_reduction on a square matrix of size 6GB (row width 28377 if double precision is used).
./bin/largemm -s 3 -m 28377 -a 2 -g 4
  1. -g
  • Description: Specify the number of GPU(s) to use. Cannot be zero.