Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tensor core to SimX #142

Closed
wants to merge 10 commits into from
Closed

Add tensor core to SimX #142

wants to merge 10 commits into from

Conversation

troibe
Copy link
Contributor

@troibe troibe commented Jun 23, 2024

I'm opening this PR to leave a few comments.

Copy link
Contributor Author

@troibe troibe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsinghania99 @nayannair I'll review the remaining files tomorrow.
Could you please give an accurate description of what should work and what should not work?
So far I can see that you seem to have implemented some performance characteristics of your matrix extension. I didn't not yet check whether it looks sound to me.
The functional characteristics of your matrix extension and tests of your extension don't seem to be implemented.

ci/blackbox.sh Outdated
Comment on lines 51 to 52
TC_SIZE=567
TC_NUM=123
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these numbers for the size and number of tensor cores picked for testing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these parameters - not being used any more.

ci/blackbox.sh Outdated
@@ -180,7 +182,7 @@ then
fi

CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"

# CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS -DTC_NUM=$TC_NUM -DTC_SIZE=$TC_SIZE $L2 $L3 $PERF_FLAG $CONFIGS"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your configuration is not included in the ci tests.
Also please remove commented out code before merging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -124,6 +124,7 @@ regression()
# test local barrier
./ci/blackbox.sh --driver=simx --app=dogfood --args="-n1 -tbar"
./ci/blackbox.sh --driver=rtlsim --app=dogfood --args="-n1 -tbar"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line or add your test if that is what you wanted to do

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test with default values for number and size of tensor cores.

Comment on lines 114 to 120
`ifndef TC_SIZE
`define TC_SIZE 4
`endif

`ifndef TC_NUM
`define TC_NUM 1
`endif
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these default values picked?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default changed to tc size = 4, tc num = 8; sweep for these parameters start from these values and hence selected as default.

`define NUM_TCU_LANES `TC_NUM
`endif

// Number of TCU units
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should not be // Number of TCU units.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comments; added in variables above.

@@ -74,6 +74,7 @@ Emulator::Emulator(const Arch &arch, const DCRS &dcrs, Core* core)
, core_(core)
, warps_(arch.num_warps(), arch)
, barriers_(arch.num_barriers(), 0)
, scratchpad(std::vector<Word>(32 * 32 * 32768)) //Fix this : Max TC_SIZE = 32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either fix it or explain better what has to be done to fix it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

Comment on lines 1563 to 1580
/*
// TODO : Fix needed for functional correctness
for(int tiles = 0 ; tiles < n_tiles ; tiles++) //What's the HW implication of this?? A counter implementation?
{
for (int i = 0; i < tc_size; i++) { //ROW-1
for (int j = 0; j < tc_size; j++) { //COL-2
int sum = 0;
for (int k = 0; k < tc_size; k++)
{ //COL-1
sum = sum + scratchpad[loop_offset + thread_offset*n_tiles + i * tc_size + k] *scratchpad[loop_offset + thread_offset*n_tiles + offset_b + (k * tc_size + j)];
}
scratchpad[accu_offset + thread_offset +(i * tc_size + j)] += sum; //[i * col2 + j] = sum
DP(3, "Scratchpad Index: " << accu_offset + (i * tc_size + j) << " , Value=" << scratchpad[accu_offset + (i * tc_size + j)]);
}
}
loop_offset += tc_size*tc_size; //Move to the next tiled matmul fragment
}
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is no matrix multiplication implemented?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in the last comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI gives the following unsued variable warnings:

>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1521:22: warning: unused variable ‘mem_addr’ [-Wunused-variable]
>>>  1521 |             uint64_t mem_addr = (base_addr+(n*mem_bytes));
>>>       |                      ^~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1522:22: warning: unused variable ‘csr_index’ [-Wunused-variable]
>>>  1522 |             uint32_t csr_index = (2*num_data_per_thread_st) + n;
>>>       |                      ^~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1523:22: warning: unused variable ‘scratchpad_index’ [-Wunused-variable]
>>>  1523 |             uint32_t scratchpad_index = (tc_size*tc_size*2) + (t*num_data_per_thread) + n;
>>>       |                      ^~~~~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1534:26: warning: comparison of integer expressions of different signedness: ‘int’ and ‘std::vector<unsigned int>::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]
>>>  1534 |         for(int i =0 ; i < scratchpad.size(); i++)
>>>       |                        ~~^~~~~~~~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1506:18: warning: unused variable ‘accu_offset’ [-Wunused-variable]
>>>  1506 |         uint32_t accu_offset = (n_tiles)*(n_tiles)*(n_tiles)*tc_size*tc_size*2;
>>>       |                  ^~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1547:43: warning: comparison of integer expressions of different signedness: ‘uint32_t’ {aka ‘unsigned int’} and ‘int’ [-Wsign-compare]
>>>  1547 |         for (uint32_t t = thread_start; t < num_threads_actv; ++t)
>>>       |                                         ~~^~~~~~~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1557:22: warning: unused variable ‘thread_offset’ [-Wunused-variable]
>>>  1557 |             uint32_t thread_offset = t*(tc_size*tc_size);
>>>       |                      ^~~~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1558:17: warning: unused variable ‘loop_offset’ [-Wunused-variable]
>>>  1558 |             int loop_offset = 0;
>>>       |                 ^~~~~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1559:17: warning: unused variable ‘offset_b’ [-Wunused-variable]
>>>  1559 |             int offset_b = n_tiles*n_tiles*n_tiles*tc_size*tc_size;
>>>       |                 ^~~~~~~~
>>> /home/travis/build/vortexgpgpu/vortex/sim/simx/execute.cpp:1545:18: warning: unused variable ‘accu_offset’ [-Wunused-variable]
>>>  1545 |         uint32_t accu_offset = (n_tiles)*(n_tiles)*(n_tiles)*tc_size*tc_size*2;
>>>       |                  ^~~~~~~~~~~

It fails shortly after.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -543,6 +554,14 @@ std::shared_ptr<Instr> Emulator::decode(uint32_t code) const {

case InstType::I: {
switch (op) {
case Opcode::TCU: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which instruction does this correspond to?

Comment on lines +312 to +353
// verify result (TODO : needs to be fixed for for functional correctness)
/*
std::cout << "verify result" << std::endl;
{
int errors = 0;
auto buf_ptr = (int8_t*)staging_buf.data();
uint64_t tc_size = kernel_arg.tc_size;
std::cout << "tc_size = " << tc_size << std::endl;
int Result[matrix_size*matrix_size];
int n_tiles = (matrix_size/tc_size);
int tc_size_f = tc_size*tc_size;

//converting buf ptr (tile by tile) to CPU style linear (row by row)
for(int k = 0; k < matrix_size/tc_size; k+= 1)
{
for(int j = 0; j < matrix_size; j+= tc_size)
{
for(int i =0; i < tc_size*tc_size; i++)
{
Result[ tc_size*matrix_size*k +j+ (i/tc_size)*matrix_size +i%(tc_size)] = buf_ptr[matrix_size*tc_size*k+tc_size*j+i];
}
}
}

for (uint32_t i = 0; i < matrix_size*matrix_size; ++i) {
//int ref = i + i;
int cur = Result[i];
if (cur != refs[i]) {
++errors;
}
}
if (errors != 0) {
std::cout << "Found " << std::dec << errors << " errors!" << std::endl;
std::cout << "FAILED!" << std::endl;
return 1;
}
else
{
std::cout << "CONDITIONALLY PASSED!" << std::endl;
}
}
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again no matrix multiplication verified?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address offsets are tightly coupled with the task ID - in kernel - to better distribute/parallelise output tile computation. Implementation and check for matrix multiplications work for most cases. However, it needs tuning for 100% sweep cases. Additionally, with the recent refactoring in Vortex codebase, task->thread mapping was updated which has not been accommodated for in offset calculation - required for correct matrix computation. The focus of the work hence lies primarily on performance analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the dead code and comments then?
You can move them to another branch and keep them there if you want to store them for later.

Copy link
Contributor Author

@troibe troibe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsinghania99 @nayannair Could you please go over the source code for the performance related implementation and clean it up a bit?

ci/blackbox.sh Outdated
@@ -180,7 +180,6 @@ then
fi

CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the empty change:

Suggested change
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"
CONFIGS="-DNUM_CLUSTERS=$CLUSTERS -DNUM_CORES=$CORES -DNUM_WARPS=$WARPS -DNUM_THREADS=$THREADS $L2 $L3 $PERF_FLAG $CONFIGS"

Comment on lines +312 to +353
// verify result (TODO : needs to be fixed for for functional correctness)
/*
std::cout << "verify result" << std::endl;
{
int errors = 0;
auto buf_ptr = (int8_t*)staging_buf.data();
uint64_t tc_size = kernel_arg.tc_size;
std::cout << "tc_size = " << tc_size << std::endl;
int Result[matrix_size*matrix_size];
int n_tiles = (matrix_size/tc_size);
int tc_size_f = tc_size*tc_size;

//converting buf ptr (tile by tile) to CPU style linear (row by row)
for(int k = 0; k < matrix_size/tc_size; k+= 1)
{
for(int j = 0; j < matrix_size; j+= tc_size)
{
for(int i =0; i < tc_size*tc_size; i++)
{
Result[ tc_size*matrix_size*k +j+ (i/tc_size)*matrix_size +i%(tc_size)] = buf_ptr[matrix_size*tc_size*k+tc_size*j+i];
}
}
}

for (uint32_t i = 0; i < matrix_size*matrix_size; ++i) {
//int ref = i + i;
int cur = Result[i];
if (cur != refs[i]) {
++errors;
}
}
if (errors != 0) {
std::cout << "Found " << std::dec << errors << " errors!" << std::endl;
std::cout << "FAILED!" << std::endl;
return 1;
}
else
{
std::cout << "CONDITIONALLY PASSED!" << std::endl;
}
}
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the dead code and comments then?
You can move them to another branch and keep them there if you want to store them for later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this regression test?
It even passes if SimX crashes.

std::cout << "cleanup" << std::endl;
cleanup();

std::cout << "PASSED!" << std::endl;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are printing the performance numbers but are those used in the regression test for performance regression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants