Skip to content

Commit

Permalink
update spgemm readme
Browse files Browse the repository at this point in the history
  • Loading branch information
cwpearson committed Feb 1, 2019
1 parent 43d08f1 commit b283d97
Showing 1 changed file with 8 additions and 33 deletions.
41 changes: 8 additions & 33 deletions labs/sgemm-regtiled-coarsened/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,13 @@
# 7-point Stencil with Thread-coarsening and Register Tiling
# Matrix Multiplication with Thread Coarsening and Register Tiling

## Objective
The purpose of this lab is to practice the thread coarsening and register tiling optimization techniques using 7-point stencil as an example.
## Objective
The purpose of this lab is to practice the thread coarsening and register tiling optimization techniques using matrix-matrix multiplication as an example.

## Procedure
1. Edit the `kernel` function in `template.cu` to implement a 7-point stencil (refer to the [lecture slides](https://bw-course.ncsa.illinois.edu/mod/resource/view.php?id=574)) with combined register tiling and x-y shared memory tiling, and thread coarsening along the z-dimension.

```
out(i, j, k) = C0 *in(i, j, k)
+ C1 * ( in(i-1, j, k)
+ in(i, j-1, k)
+ in(i, j, k-1)
+ in(i+1, j, k)
+ in(i, j+1, k)
+ in(i, j, k+1) )
```
2. Edit the `launchStencil` function in `template.cu` to launch the kernel you implemented. The function should launch 2D CUDA grid and blocks, where each thread is responsible for computing an entire column in the z-deminsion.
`A0` and `Anext` in the code template correspond to `in` and `out`, respectively. The output dimension of the 7-point stencil computation is one smaller than the input dimension on both sides for all boundaries (e.g., output dimension is 6x6x6 for an input of 8x8x8). Only those "internal" elements needs to be calculated.
## Procedure
\noindent \textbf{Step 1:} [Instructions on how to retrieve the new lab package.]
\\
\\
Edit the file `template.cu` to launch and implement a matrix-matrix multiplication kernel that uses thread coarsening and register tiling optimization techniques. The first input matrix has a column major layout and shall be tiled in the registers, the second input matrix has a row major layout and shall be tiled in shared memory, and the output matrix has a column major layout and shall be tiled in the registers. Macros have been provided to help you with accessing these matrices easily.

3. Test your code using rai

Expand All @@ -27,17 +16,3 @@ The purpose of this lab is to practice the thread coarsening and register tiling
Be sure to add any additional flags that are required by your course (`--queue` or others).

4. Submit your code on rai
## Other notes
To simplify the kernel code, you do not need to support input data with z-extent less than 2.
The data is stored in column-major order. For example, you might consider using a macro to simplify your data access indexing:
```c++
__global__ void kernel(...) {}
#define A0(i, j, k) A0[((k)*ny + (j))*nx + (i)]
// your kernel code
#undef A0
}
```

0 comments on commit b283d97

Please sign in to comment.