Skip to content

Latest commit

 

History

History
2356 lines (1464 loc) · 21.7 KB

README.md

File metadata and controls

2356 lines (1464 loc) · 21.7 KB

An Alternative SGEMM(DGEMM) on VEGA (MI25, MI50, MI60)

to Verify Power by LDS, SGPR , and Data Forwarding

 

1 Legacy DGEMM implementation

https://github.com/NervanaSystems/maxas/wiki/SGEMM has a detailed explain of SGEMM on Maxwell Architecture. Most SGEMM/DGEMM implementation on GPU are using similar algorithms. The top level idea of legacy SGEMM/DGEMM are implemented as Following:

  • Using work group Size (64,1,1)
  • Each work group computes the matrix C’s  region from (m,n),  to  [m+64, n+63],  we calls [64x64] macro-tile for workgroup.  In this example, only 64x64  macro tile size is discussed as example.
  • Each work group will load Matrix A, 64 * K , Matrix B K * 64 ,  do 64 * K * 64 times of  FMA computing
  • Matrix A and Matrix B will be loaded into LDS,
  • Every Thread does Matrix computing matrix A= 8xK, Matrix B=Kx8 for SGEMMEvery thread computes matrix C’s 8x8 micro tile size.
  • For Each workgroup: Matrix A will be read 8xK times from LDS
  • For Each workgroup: matrix B will be read 8xK times from LDS
  • For Each workgroup: Matrix A will be read 64xKx64 times from VGPR
  • For Each workgroup: Matrix B will be read 64xKx64 times from VGPR
  • For Each workgroup: Matrix C will be read and write  64xKx64 times from VGPR

 

Memory read/write occupies very high ratio of total power energy.  SGEMM/DGEMM computing includes following memory accesses in modern GPUs:

  • External Video memory Read from GDDR or HBM to L2 Cache
  • From L2 Cache to L1 Cache
  • From L1 Cache to LDS
  • From LDS to VGPR
  • FMA reads VGPRs only for matrix Sum

 

In general,  LDS/VGPR occupies almost 50% total energy for SGEMM/DGEMM. 

 

2 Very Low Power SGEMM/DGEMM Algorithms for SGEMM

2.1 Macro Tile per Workgroup and Micro-Tile per Thread

The VLP SGEMM uses work group size 64 for macro tile M=64, N=64.

The workgroup size of 128 uses macro tile size M=64, N = 128. 

The workgroup size of 256 uses macro tile size M=64,  N = 256. 

The micro tile size for each thread is M=64 and N =1.  Each thread computes Matrix A= 64xK, Matrix B = Kx64,  result in Matrix-C  64 x1 . 

For 64 threads,  the Matrix-C’s address is continuous for each M.

In this paper,  the algorithm is based on macro tile size M=64 and N=64 if there is no special notation .

To have best use of  Matrix A for SQC constants, 

  • hipBlockIdx_x =  N/64
  • hipBlockIdx_y = M/64

 

2.2  Matrix A  Base Offset Per Wave

Every  block has one base address for its Matrix A. 

matrix_A_base_offset  = hipBlockIdx_y *  64  * lda;

 

2.3 Matrix B Base Offset per Wave

Every  block has one base address for its Matrix B. 

matrix_B_base_offset  = hipBlockIdx_x *  64  * ldb;

2.4 Matrix A’s Offset for Each K

matrix_A_koffset = k * sizeof (float)

The algorithm reads Matrix A’s data by  Assembly Instruction  “s_load_dwordx8”

s_load_dwordx8 s[32:39], s[12:13], s18

AMD  GCN architects has 96 available SGPRs . This algorithm uses SGPR s32 to SGPR s95.  It has only 64 SGPRs to read Matrix A’s data. 

Each group of s_load_dwordx8 instructions reads 64 data from 8x M  and 8xK. The algorithm has 8x groups to read 64x different M. 

 

AMD GCN architect does not support in-order return of s_load_dword.  So there is no double buffer loading of Matrix A for this algorithm.  

We postpone the performance analysis of limited SGPR number and unhiding latency  by out of order SGPR return.

 

2.5 Double Buffer Prefetch of Matrix B

Each thread uses micro-tile size M=64, N=1.  Each thread needs 8x VGPRs to load 1x N’s 8xK data.    The algorithm uses global_load_dwordx4 to have best cache line hit.  The next memory read instruction reads next 4 DWORDs of the same cache line.

global_load_dwordx4 v[68:71], v[2:3], s[20:21]

s_add_u32 s20, s20, 16                       

s_addc_u32 s21, s21, 0                       

global_load_dwordx4 v[72:75], v[2:3], s[20:21]

s_add_u32 s20, s20, 16                       

s_addc_u32 s21, s21, 0                       

Double buffer has better latency hiding.  It needs 16x VGPRs to support this feature.

2.6 VGPRs Allocations

Every thread needs V[2:3] for Matrix B’s per thread offset. 

Double Buffer Loading of Matrix B needs 16x VGPRs.

 64x M needs 64x VGPRs. 

In addition to hipThreadIdx_x , totally 16 + 64 + 2 + 1 = 83 VGPRs.

83 VGPRs means 3 waves per SIMD or 3 workgroups per CU.  It is good to have good performance.

 

2.7 NO LDS Operation At All

2.8 No Barrier At All

2.9 FMA with SGPR source and Data Forwarding to Saving SGPR

Modern GPU usually has one constant loading cache which is independent from Texture/Buffer L1 Cache.  SIMD FMA  instructions allows to have one operand from Constant data.  AMD GCN architecture even promotes the constants into Scalar GPRs.  The constant Cache data can be stored into Scalar SGPRs.  The FMA instruction of GCN has following syntax to support SGPR:

v_fma_f32 v4, v68, s32, v4

v_fma_f32 v4, v69, s33, v4

v_fma_f32 v4, v70, s34, v4

v_fma_f32 v4, v71, s35, v4

v_fma_f32 v4, v72, s36, v4

v_fma_f32 v4, v73, s37, v4

v_fma_f32 v4, v74, s38, v4

v_fma_f32 v4, v75, s39, v4

 

v_fma_f64  with SGPRs means 25% less GPR read/write access.  In other words, it is possible to save 25% dynamic power of VGPR access.

2.10 Matrix C Address

Matrix C address is very similar to Matrix B since every thread has different N value.

2.11 Theoretical Comparison of VGPR/L1 Cache/LDS Access

Following table give the example of Macro Tile Size M=64, N =256.  It is very clear that this new SGEMM algorithm reduces 70% VGPR reading by SQC constant Loading and Data Forwarding of Accumulator.  

Costs for Matrix Multiply 64x1x256

Legeacy

SQC

Unit in FP64

LDS

Non-LDS

Matrix A L2-L1

64

64

Matrix A VGPR Write

576

64

Matrix A VGPR Read

16384

64

Matrix A LDS Write

64

0

Matrix A LDS Read

512

0

Matrix B L2-L1

256

256

Matrix B L1 Read

256

256

matrix B VGPR write

2304

256

matrix B VGPR read

16384

16384

Matrix B LDS write

256

0

Matrix B LDS read

2304

0

Matrix C VGPR read/write+

32768

4096

SUM-L2-L1

320

320

SUM-L1-Read

320

320

VGPR Read/Write

68416

20864

LDS    Read/Write

3136

0

Barrier

1

0

 

 

However, there are several performance limits to prevent this kernel to achieve more than 78% performance on AMD GCN architect.

  • AMD GCN supports only 96 SGPRs for program. This limitation prevents SGEMM kernel to do buffer loading.
  • AMD GCN returns constants out of order. The SGEMM kernel has to use “s_waitcnt lgkmcnt(0)” to avoid dirty data return .  It makes the latency hiding very hard.

3 Benchmark 

3.1 Performance Testing of SGEMM_64x256

 

The following result is measured on MI60 with different GPU engine frequencies with fixed memory frequency = 800mhz.

K=640

GFX1700Mhz

GFX1500Mhz

GFX1300Mhz

GFX1100Mhz

M=N=256

0.423

0.378

0.329

0.282

M=N=512

1.125

1.052

1.033

0.896

M=N=768

2.458

2.264

2.092

1.853

M=N=1024

4.368

3.903

3.622

3.331

M=N=1280

5.687

5.213

4.753

4.241

M=N=1536

7.058

6.435

5.739

4.995

M=N=1792

6.493

5.972

5.463

4.807

M=N=2048

8.13

7.448

6.797

6.047

M=N=2304

8.366

7.63

6.828

5.95

M=N=2560

8.561

7.856

7.11

6.226

M=N=2816

9.35

8.558

7.711

6.741

M=N=3072

9.825

8.918

8.048

7.071

M=N=3328

9.758

8.896

8.026

6.998

M=N=3584

9.66

8.875

7.966

6.968

M=N=3840

9.868

9.002

8.139

7.089

M=N=4096

9.954

9.145

8.226

7.185

M=N=4352

9.821

9.07

8.192

7.229

M=N=4608

9.8

9.074

8.203

7.245

M=N=4864

9.856

9.088

8.252

7.258

M=N=5120

9.781

9.088

8.228

7.281

M=N=5376

9.76

9.101

8.285

7.304

M=N=5632

9.8

9.122

8.285

7.346

M=N=5888

9.737

9.13

8.37

7.372

M=N=6144

9.678

9.092

8.302

7.347

M=N=6400

9.672

9.121

8.328

7.383

M=N=6656

9.674

9.173

8.343

7.414

M=N=6912

9.684

9.166

8.375

7.408

M=N=7168

9.638

9.18

8.359

7.413

M=N=7424

9.657

9.155

8.377

7.452

M=N=7680

9.655

9.16

8.4

7.444

M=N=7936

9.67

9.168

8.398

7.466

M=N=8192

9.61

9.133

8.414

7.42

M=N=8448

9.666

9.211

8.413

7.489

M=N=8704

9.662

9.236

8.417

7.465

M=N=8960

9.651

9.217

8.471

7.511

M=N=9216

9.608

9.199

8.459

7.477

M=N=9472

9.643

9.234

8.454

7.509

M=N=9728

9.689

9.227

8.449

7.527

M=N=9984

9.682

9.258

8.484

7.517

M=N=10240

9.605

9.258

8.453

7.498

M=N=10496

9.716

9.297

8.493

7.518

M=N=10752

9.664

9.299

8.523

7.539

M=N=11008

9.672

9.299

8.521

7.537

M=N=11264

9.62

9.253

8.517

7.527

M=N=11520

9.672

9.297

8.5

7.532

M=N=11776

9.652

9.275

8.497

7.548

M=N=12032

9.675

9.318

8.515

7.534

M=N=12288

9.634

9.277

8.493

7.521

M=N=12544

9.681

9.339

8.531

7.556

M=N=12800

9.675

9.326

8.524

7.553

M=N=13056

9.675

9.362

8.54

7.567

M=N=13312

9.666

9.344

8.57

7.581

M=N=13568

9.698

9.403

8.552

7.556

M=N=13824

9.714

9.392

8.565

7.581

M=N=14080

9.703

9.429

8.57

7.591

M=N=14336

9.604

9.353

8.559

7.58

M=N=14592

9.674

9.391

8.558

7.605

M=N=14848

9.657

9.312

8.545

7.587

M=N=15104

9.601

9.266

8.495

7.535

M=N=15360

9.61

9.322

8.499

7.516

M=N=15616

9.661

9.351

8.541

7.554

M=N=15872

9.663

9.363

8.562

7.591

M=N=16128

9.71

9.426

8.575

7.583

M=N=16384

9.532

9.228

8.508

7.532

 

3.2 Power Testing

              Non-workload   == 42 watts,  GFX1700Mhz

  • Data Forwarding:

    • M=N=4096, K=640, Max Power = 265 watts,  with 9.5T
  • NO-Forwarding,

    • M=N=4096, K=640, Max Power = 284 watts,  with 9.18T

              Non-workload   == 36 watts,  GFX1500Mhz

  • Data Forwarding:

    • M=N=4096, K=640, Max Power = 223-watts,  with 9.132T
  • NO-Forwarding,

    • M=N=4096, K=640, Max Power = 240 watts,  with 8.986T

 

4 Run the test

4.1 Run the test

Hardware:  MI60/MI50

Software: ROCm

Command Line to Build the test:

hipcc sgemm_sqc_test.cpp -o sgemm_sqc_test.exe

Command Lien to run the test:

./ sgemm_sqc_test.exe <M> <N> <K> 64 256 <iterations=10> <verify=0>

For example:

./ sgemm_sqc_test.exe 16384 16384 640 64 256 10 0

4.2 Source Code

The GCN LLVM assembly is written in sgemm_64x256_sqc.cpp by inline assembly. 

Compiling  Command line of  sgemm_64x256_sqc.cpp :

hipcc sgemm-64x256-sqc.cpp -o sgemm-64x256-sqc.out

Extract the kernel by following Command line which will generate sgemm-64x256-sqc.out-000-gfx906.isa:

extractkernel -i sgemm_64x256_sqc.out

Extract the correct kernel from sgemm-64x256-sqc.out-000-gfx906.isa and fill into sgemm_64x256_sqc.s. 

Compile sgemm_64x256_sqc.s into LLVM  code object :

/opt/rocm/hcc/bin/clang -x assembler -target amdgcn--amdhsa -mcpu=gfx906 -mno-code-object-v3 sgemm_64x256_sqc.s -o sgemm_sqc.co