-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tensor core to SimX #142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsinghania99 @nayannair I'll review the remaining files tomorrow.
Could you please give an accurate description of what should work and what should not work?
So far I can see that you seem to have implemented some performance characteristics of your matrix extension. I didn't not yet check whether it looks sound to me.
The functional characteristics of your matrix extension and tests of your extension don't seem to be implemented.
ci/blackbox.sh
Outdated
TC_SIZE=567 | ||
TC_NUM=123 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these numbers for the size and number of tensor cores picked for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these parameters - not being used any more.
ci/blackbox.sh
Outdated
@@ -180,7 +182,7 @@ then | |||
fi | |||
|
|||
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS" | |||
|
|||
# CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS -DTC_NUM=$TC_NUM -DTC_SIZE=$TC_SIZE $L2 $L3 $PERF_FLAG $CONFIGS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your configuration is not included in the ci tests.
Also please remove commented out code before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
ci/regression.sh.in
Outdated
@@ -124,6 +124,7 @@ regression() | |||
# test local barrier | |||
./ci/blackbox.sh --driver=simx --app=dogfood --args="-n1 -tbar" | |||
./ci/blackbox.sh --driver=rtlsim --app=dogfood --args="-n1 -tbar" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line or add your test if that is what you wanted to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test with default values for number and size of tensor cores.
hw/rtl/VX_config.vh
Outdated
`ifndef TC_SIZE | ||
`define TC_SIZE 4 | ||
`endif | ||
|
||
`ifndef TC_NUM | ||
`define TC_NUM 1 | ||
`endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these default values picked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default changed to tc size = 4, tc num = 8; sweep for these parameters start from these values and hence selected as default.
hw/rtl/VX_config.vh
Outdated
`define NUM_TCU_LANES `TC_NUM | ||
`endif | ||
|
||
// Number of TCU units |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should not be // Number of TCU units
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed comments; added in variables above.
sim/simx/emulator.cpp
Outdated
@@ -74,6 +74,7 @@ Emulator::Emulator(const Arch &arch, const DCRS &dcrs, Core* core) | |||
, core_(core) | |||
, warps_(arch.num_warps(), arch) | |||
, barriers_(arch.num_barriers(), 0) | |||
, scratchpad(std::vector<Word>(32 * 32 * 32768)) //Fix this : Max TC_SIZE = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either fix it or explain better what has to be done to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments.
sim/simx/execute.cpp
Outdated
/* | ||
// TODO : Fix needed for functional correctness | ||
for(int tiles = 0 ; tiles < n_tiles ; tiles++) //What's the HW implication of this?? A counter implementation? | ||
{ | ||
for (int i = 0; i < tc_size; i++) { //ROW-1 | ||
for (int j = 0; j < tc_size; j++) { //COL-2 | ||
int sum = 0; | ||
for (int k = 0; k < tc_size; k++) | ||
{ //COL-1 | ||
sum = sum + scratchpad[loop_offset + thread_offset*n_tiles + i * tc_size + k] *scratchpad[loop_offset + thread_offset*n_tiles + offset_b + (k * tc_size + j)]; | ||
} | ||
scratchpad[accu_offset + thread_offset +(i * tc_size + j)] += sum; //[i * col2 + j] = sum | ||
DP(3, "Scratchpad Index: " << accu_offset + (i * tc_size + j) << " , Value=" << scratchpad[accu_offset + (i * tc_size + j)]); | ||
} | ||
} | ||
loop_offset += tc_size*tc_size; //Move to the next tiled matmul fragment | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is no matrix multiplication implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in the last comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI gives the following unsued variable warnings:
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1521:22: warning: unused variable ‘mem_addr’ [-Wunused-variable]
>>> 1521 | uint64_t mem_addr = (base_addr+(n*mem_bytes));
>>> | ^~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1522:22: warning: unused variable ‘csr_index’ [-Wunused-variable]
>>> 1522 | uint32_t csr_index = (2*num_data_per_thread_st) + n;
>>> | ^~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1523:22: warning: unused variable ‘scratchpad_index’ [-Wunused-variable]
>>> 1523 | uint32_t scratchpad_index = (tc_size*tc_size*2) + (t*num_data_per_thread) + n;
>>> | ^~~~~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1534:26: warning: comparison of integer expressions of different signedness: ‘int’ and ‘std::vector<unsigned int>::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
>>> 1534 | for(int i =0 ; i < scratchpad.size(); i++)
>>> | ~~^~~~~~~~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1506:18: warning: unused variable ‘accu_offset’ [-Wunused-variable]
>>> 1506 | uint32_t accu_offset = (n_tiles)*(n_tiles)*(n_tiles)*tc_size*tc_size*2;
>>> | ^~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1547:43: warning: comparison of integer expressions of different signedness: ‘uint32_t’ {aka ‘unsigned int’} and ‘int’ [-Wsign-compare]
>>> 1547 | for (uint32_t t = thread_start; t < num_threads_actv; ++t)
>>> | ~~^~~~~~~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1557:22: warning: unused variable ‘thread_offset’ [-Wunused-variable]
>>> 1557 | uint32_t thread_offset = t*(tc_size*tc_size);
>>> | ^~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1558:17: warning: unused variable ‘loop_offset’ [-Wunused-variable]
>>> 1558 | int loop_offset = 0;
>>> | ^~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1559:17: warning: unused variable ‘offset_b’ [-Wunused-variable]
>>> 1559 | int offset_b = n_tiles*n_tiles*n_tiles*tc_size*tc_size;
>>> | ^~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1545:18: warning: unused variable ‘accu_offset’ [-Wunused-variable]
>>> 1545 | uint32_t accu_offset = (n_tiles)*(n_tiles)*(n_tiles)*tc_size*tc_size*2;
>>> | ^~~~~~~~~~~
It fails shortly after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -543,6 +554,14 @@ std::shared_ptr<Instr> Emulator::decode(uint32_t code) const { | |||
|
|||
case InstType::I: { | |||
switch (op) { | |||
case Opcode::TCU: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which instruction does this correspond to?
// verify result (TODO : needs to be fixed for for functional correctness) | ||
/* | ||
std::cout << "verify result" << std::endl; | ||
{ | ||
int errors = 0; | ||
auto buf_ptr = (int8_t*)staging_buf.data(); | ||
uint64_t tc_size = kernel_arg.tc_size; | ||
std::cout << "tc_size = " << tc_size << std::endl; | ||
int Result[matrix_size*matrix_size]; | ||
int n_tiles = (matrix_size/tc_size); | ||
int tc_size_f = tc_size*tc_size; | ||
|
||
//converting buf ptr (tile by tile) to CPU style linear (row by row) | ||
for(int k = 0; k < matrix_size/tc_size; k+= 1) | ||
{ | ||
for(int j = 0; j < matrix_size; j+= tc_size) | ||
{ | ||
for(int i =0; i < tc_size*tc_size; i++) | ||
{ | ||
Result[ tc_size*matrix_size*k +j+ (i/tc_size)*matrix_size +i%(tc_size)] = buf_ptr[matrix_size*tc_size*k+tc_size*j+i]; | ||
} | ||
} | ||
} | ||
|
||
for (uint32_t i = 0; i < matrix_size*matrix_size; ++i) { | ||
//int ref = i + i; | ||
int cur = Result[i]; | ||
if (cur != refs[i]) { | ||
++errors; | ||
} | ||
} | ||
if (errors != 0) { | ||
std::cout << "Found " << std::dec << errors << " errors!" << std::endl; | ||
std::cout << "FAILED!" << std::endl; | ||
return 1; | ||
} | ||
else | ||
{ | ||
std::cout << "CONDITIONALLY PASSED!" << std::endl; | ||
} | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again no matrix multiplication verified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address offsets are tightly coupled with the task ID - in kernel - to better distribute/parallelise output tile computation. Implementation and check for matrix multiplications work for most cases. However, it needs tuning for 100% sweep cases. Additionally, with the recent refactoring in Vortex codebase, task->thread mapping was updated which has not been accommodated for in offset calculation - required for correct matrix computation. The focus of the work hence lies primarily on performance analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the dead code and comments then?
You can move them to another branch and keep them there if you want to store them for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsinghania99 @nayannair Could you please go over the source code for the performance related implementation and clean it up a bit?
ci/blackbox.sh
Outdated
@@ -180,7 +180,6 @@ then | |||
fi | |||
|
|||
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the empty change:
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS" | |
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS" | |
// verify result (TODO : needs to be fixed for for functional correctness) | ||
/* | ||
std::cout << "verify result" << std::endl; | ||
{ | ||
int errors = 0; | ||
auto buf_ptr = (int8_t*)staging_buf.data(); | ||
uint64_t tc_size = kernel_arg.tc_size; | ||
std::cout << "tc_size = " << tc_size << std::endl; | ||
int Result[matrix_size*matrix_size]; | ||
int n_tiles = (matrix_size/tc_size); | ||
int tc_size_f = tc_size*tc_size; | ||
|
||
//converting buf ptr (tile by tile) to CPU style linear (row by row) | ||
for(int k = 0; k < matrix_size/tc_size; k+= 1) | ||
{ | ||
for(int j = 0; j < matrix_size; j+= tc_size) | ||
{ | ||
for(int i =0; i < tc_size*tc_size; i++) | ||
{ | ||
Result[ tc_size*matrix_size*k +j+ (i/tc_size)*matrix_size +i%(tc_size)] = buf_ptr[matrix_size*tc_size*k+tc_size*j+i]; | ||
} | ||
} | ||
} | ||
|
||
for (uint32_t i = 0; i < matrix_size*matrix_size; ++i) { | ||
//int ref = i + i; | ||
int cur = Result[i]; | ||
if (cur != refs[i]) { | ||
++errors; | ||
} | ||
} | ||
if (errors != 0) { | ||
std::cout << "Found " << std::dec << errors << " errors!" << std::endl; | ||
std::cout << "FAILED!" << std::endl; | ||
return 1; | ||
} | ||
else | ||
{ | ||
std::cout << "CONDITIONALLY PASSED!" << std::endl; | ||
} | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the dead code and comments then?
You can move them to another branch and keep them there if you want to store them for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this regression test?
It even passes if SimX crashes.
std::cout << "cleanup" << std::endl; | ||
cleanup(); | ||
|
||
std::cout << "PASSED!" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you are printing the performance numbers but are those used in the regression test for performance regression?
I'm opening this PR to leave a few comments.