Add tensor core to SimX #142

troibe · 2024-06-23T22:44:18Z

I'm opening this PR to leave a few comments.

troibe

@vsinghania99 @nayannair I'll review the remaining files tomorrow.
Could you please give an accurate description of what should work and what should not work?
So far I can see that you seem to have implemented some performance characteristics of your matrix extension. I didn't not yet check whether it looks sound to me.
The functional characteristics of your matrix extension and tests of your extension don't seem to be implemented.

troibe · 2024-06-24T19:40:53Z

ci/blackbox.sh

+TC_SIZE=567
+TC_NUM=123


Why were these numbers for the size and number of tensor cores picked for testing?

Removing these parameters - not being used any more.

troibe · 2024-06-24T19:43:18Z

ci/blackbox.sh

@@ -180,7 +182,7 @@ then
 fi

 CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"
-
+# CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS -DTC_NUM=$TC_NUM -DTC_SIZE=$TC_SIZE $L2 $L3 $PERF_FLAG $CONFIGS"


Your configuration is not included in the ci tests.
Also please remove commented out code before merging.

troibe · 2024-06-24T19:43:27Z

ci/regression.sh.in

@@ -124,6 +124,7 @@ regression()
    # test local barrier
    ./ci/blackbox.sh --driver=simx --app=dogfood --args="-n1 -tbar"
    ./ci/blackbox.sh --driver=rtlsim --app=dogfood --args="-n1 -tbar"
+


remove empty line or add your test if that is what you wanted to do

Added test with default values for number and size of tensor cores.

troibe · 2024-06-24T19:45:31Z

hw/rtl/VX_config.vh

+`ifndef TC_SIZE
+`define TC_SIZE 4
+`endif
+
+`ifndef TC_NUM
+`define TC_NUM 1
+`endif


Why were these default values picked?

Default changed to tc size = 4, tc num = 8; sweep for these parameters start from these values and hence selected as default.

troibe · 2024-06-24T19:48:35Z

hw/rtl/VX_config.vh

+`define NUM_TCU_LANES   `TC_NUM
+`endif
+
+// Number of TCU units


This probably should not be // Number of TCU units.

Removed comments; added in variables above.

troibe · 2024-06-24T20:30:51Z

sim/simx/emulator.cpp

@@ -74,6 +74,7 @@ Emulator::Emulator(const Arch &arch, const DCRS &dcrs, Core* core)
    , core_(core)
    , warps_(arch.num_warps(), arch)
    , barriers_(arch.num_barriers(), 0)
+    , scratchpad(std::vector<Word>(32 * 32 * 32768)) //Fix this : Max TC_SIZE = 32


Either fix it or explain better what has to be done to fix it.

Added comments.

troibe · 2024-06-24T20:36:52Z

sim/simx/execute.cpp

+            /*
+            // TODO : Fix needed for functional correctness
+            for(int tiles = 0 ; tiles < n_tiles ; tiles++)  //What's the HW implication of this?? A counter implementation?
+            { 
+              for (int i = 0; i < tc_size; i++) { //ROW-1
+                for (int j = 0; j < tc_size; j++) { //COL-2
+                  int sum = 0;
+                  for (int k = 0; k < tc_size; k++)
+                  { //COL-1
+                    sum = sum + scratchpad[loop_offset + thread_offset*n_tiles + i * tc_size + k] *scratchpad[loop_offset + thread_offset*n_tiles + offset_b + (k * tc_size + j)];
+                  }
+                  scratchpad[accu_offset + thread_offset +(i * tc_size + j)] += sum; //[i * col2 + j] = sum
+                  DP(3, "Scratchpad Index: " << accu_offset + (i * tc_size + j) << " , Value=" << scratchpad[accu_offset + (i * tc_size + j)]);
+                }
+              }
+              loop_offset += tc_size*tc_size; //Move to the next tiled matmul fragment
+            }
+            */


So there is no matrix multiplication implemented?

Explained in the last comment.

troibe · 2024-06-24T20:40:09Z

sim/simx/execute.cpp

The CI gives the following unsued variable warnings:

>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1521:22: warning: unused variable ‘mem_addr’ [-Wunused-variable] >>> 1521 | uint64_t mem_addr = (base_addr+(n*mem_bytes)); >>> | ^~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1522:22: warning: unused variable ‘csr_index’ [-Wunused-variable] >>> 1522 | uint32_t csr_index = (2*num_data_per_thread_st) + n; >>> | ^~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1523:22: warning: unused variable ‘scratchpad_index’ [-Wunused-variable] >>> 1523 | uint32_t scratchpad_index = (tc_size*tc_size*2) + (t*num_data_per_thread) + n; >>> | ^~~~~~~~~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1534:26: warning: comparison of integer expressions of different signedness: ‘int’ and ‘std::vector<unsigned int>::size_type’ {aka ‘long unsigned int’} [-Wsign-compare] >>> 1534 | for(int i =0 ; i < scratchpad.size(); i++) >>> | ~~^~~~~~~~~~~~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1506:18: warning: unused variable ‘accu_offset’ [-Wunused-variable] >>> 1506 | uint32_t accu_offset = (n_tiles)*(n_tiles)*(n_tiles)*tc_size*tc_size*2; >>> | ^~~~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1547:43: warning: comparison of integer expressions of different signedness: ‘uint32_t’ {aka ‘unsigned int’} and ‘int’ [-Wsign-compare] >>> 1547 | for (uint32_t t = thread_start; t < num_threads_actv; ++t) >>> | ~~^~~~~~~~~~~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1557:22: warning: unused variable ‘thread_offset’ [-Wunused-variable] >>> 1557 | uint32_t thread_offset = t*(tc_size*tc_size); >>> | ^~~~~~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1558:17: warning: unused variable ‘loop_offset’ [-Wunused-variable] >>> 1558 | int loop_offset = 0; >>> | ^~~~~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1559:17: warning: unused variable ‘offset_b’ [-Wunused-variable] >>> 1559 | int offset_b = n_tiles*n_tiles*n_tiles*tc_size*tc_size; >>> | ^~~~~~~~ >>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1545:18: warning: unused variable ‘accu_offset’ [-Wunused-variable] >>> 1545 | uint32_t accu_offset = (n_tiles)*(n_tiles)*(n_tiles)*tc_size*tc_size*2; >>> | ^~~~~~~~~~~

It fails shortly after.

troibe · 2024-06-24T20:43:32Z

sim/simx/decode.cpp

@@ -543,6 +554,14 @@ std::shared_ptr<Instr> Emulator::decode(uint32_t code) const {

  case InstType::I: {
    switch (op) {
+    case Opcode::TCU: {


Which instruction does this correspond to?

troibe · 2024-06-24T20:46:16Z

tests/regression/matmul/main.cpp

+  // verify result (TODO : needs to be fixed for for functional correctness)
+  /*
+  std::cout << "verify result" << std::endl;  
+  {
+    int errors = 0;
+    auto buf_ptr = (int8_t*)staging_buf.data();
+    uint64_t tc_size = kernel_arg.tc_size;
+    std::cout << "tc_size = " << tc_size << std::endl;
+    int Result[matrix_size*matrix_size];
+    int n_tiles = (matrix_size/tc_size);
+    int tc_size_f = tc_size*tc_size;
+
+    //converting buf ptr (tile by tile) to CPU style linear (row by row)
+    for(int k = 0; k < matrix_size/tc_size; k+= 1)
+    {
+      for(int j = 0; j < matrix_size; j+= tc_size)
+      {
+        for(int i =0; i < tc_size*tc_size; i++)
+        {
+          Result[ tc_size*matrix_size*k +j+ (i/tc_size)*matrix_size +i%(tc_size)]  = buf_ptr[matrix_size*tc_size*k+tc_size*j+i];
+        }
+      }    
+    }
+
+    for (uint32_t i = 0; i < matrix_size*matrix_size; ++i) {
+      //int ref = i + i; 
+      int cur = Result[i];
+      if (cur != refs[i]) {
+        ++errors;
+      }
+    }
+    if (errors != 0) {
+      std::cout << "Found " << std::dec << errors << " errors!" << std::endl;
+      std::cout << "FAILED!" << std::endl;
+      return 1;  
+    }
+    else
+    {
+      std::cout << "CONDITIONALLY PASSED!" << std::endl;
+    }
+  }
+  */


Again no matrix multiplication verified?

Address offsets are tightly coupled with the task ID - in kernel - to better distribute/parallelise output tile computation. Implementation and check for matrix multiplications work for most cases. However, it needs tuning for 100% sweep cases. Additionally, with the recent refactoring in Vortex codebase, task->thread mapping was updated which has not been accommodated for in offset calculation - required for correct matrix computation. The focus of the work hence lies primarily on performance analysis.

Could you remove the dead code and comments then?
You can move them to another branch and keep them there if you want to store them for later.

troibe

@vsinghania99 @nayannair Could you please go over the source code for the performance related implementation and clean it up a bit?

troibe · 2024-07-19T11:03:52Z

ci/blackbox.sh

@@ -180,7 +180,6 @@ then
 fi

 CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"


Please remove the empty change:

Suggested change

CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"

CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"

troibe · 2024-07-19T11:06:44Z

tests/regression/matmul/main.cpp

+  // verify result (TODO : needs to be fixed for for functional correctness)
+  /*
+  std::cout << "verify result" << std::endl;  
+  {
+    int errors = 0;
+    auto buf_ptr = (int8_t*)staging_buf.data();
+    uint64_t tc_size = kernel_arg.tc_size;
+    std::cout << "tc_size = " << tc_size << std::endl;
+    int Result[matrix_size*matrix_size];
+    int n_tiles = (matrix_size/tc_size);
+    int tc_size_f = tc_size*tc_size;
+
+    //converting buf ptr (tile by tile) to CPU style linear (row by row)
+    for(int k = 0; k < matrix_size/tc_size; k+= 1)
+    {
+      for(int j = 0; j < matrix_size; j+= tc_size)
+      {
+        for(int i =0; i < tc_size*tc_size; i++)
+        {
+          Result[ tc_size*matrix_size*k +j+ (i/tc_size)*matrix_size +i%(tc_size)]  = buf_ptr[matrix_size*tc_size*k+tc_size*j+i];
+        }
+      }    
+    }
+
+    for (uint32_t i = 0; i < matrix_size*matrix_size; ++i) {
+      //int ref = i + i; 
+      int cur = Result[i];
+      if (cur != refs[i]) {
+        ++errors;
+      }
+    }
+    if (errors != 0) {
+      std::cout << "Found " << std::dec << errors << " errors!" << std::endl;
+      std::cout << "FAILED!" << std::endl;
+      return 1;  
+    }
+    else
+    {
+      std::cout << "CONDITIONALLY PASSED!" << std::endl;
+    }
+  }
+  */


Could you remove the dead code and comments then?
You can move them to another branch and keep them there if you want to store them for later.

troibe · 2024-07-19T11:43:14Z

tests/regression/matmul/matmul_regression.sh

What does this regression test?
It even passes if SimX crashes.

troibe · 2024-07-19T11:58:03Z

tests/regression/matmul/main.cpp

+  std::cout << "cleanup" << std::endl;  
+  cleanup();
+
+  std::cout << "PASSED!" << std::endl;


I see that you are printing the performance numbers but are those used in the regression test for performance regression?

dummy commit

Develop

vsinghania99 and others added 3 commits June 17, 2024 04:28

Tensor cores in Vortex

99c6a1a

Script checkin and code cleanup

0e3badf

Moved tc_num, tc_size param to makefile args

a378aed

troibe commented Jun 24, 2024

View reviewed changes

nayannair added 2 commits June 25, 2024 03:18

Fixes for PR

5b0fc8c

dummy commit

5e63b8f

troibe commented Jul 19, 2024

View reviewed changes

tinebp and others added 5 commits August 2, 2024 15:45

Merge pull request #144 from nayannair/tensor-core

09028d8

dummy commit

Merge branch 'develop' into tensor-core

4a60606

update the code accessing registers in obsoleted way

d1175a0

Merge pull request #184 from vortexgpgpu/develop

6c72597

Develop

support 64bit

b7531c9

jaewon-lee-github closed this Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tensor core to SimX #142

Add tensor core to SimX #142

troibe commented Jun 23, 2024

troibe left a comment

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jun 24, 2024

troibe Jun 24, 2024

vsinghania99 Jun 25, 2024

troibe Jul 19, 2024

troibe left a comment

troibe Jul 19, 2024

troibe Jul 19, 2024

troibe Jul 19, 2024

troibe Jul 19, 2024

		@@ -180,7 +180,6 @@ then
		fi

		CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"

		TC_SIZE=567
		TC_NUM=123

Add tensor core to SimX #142

Add tensor core to SimX #142

Conversation

troibe commented Jun 23, 2024

troibe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

troibe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment