GCN_asm_tutorial.html

<!DOCTYPE html>
<html>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-83146017-1', 'auto');
  ga('send', 'pageview');

</script>
  <head>
    <meta charset='utf-8'>
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <meta name="description" content="ROCm, A New Era in Open GPU Computing : Platform for GPU Enabled HPC and UltraScale Computing ">

    <link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">

    <title>ROCm, A New Era in Open GPU Computing</title>
  </head>

  <body>

    <!-- HEADER -->
    <div id="header_wrap" class="outer">
        <header class="inner">
          <a id="forkme_banner" href="https://github.com/RadeonOpenCompute">View on GitHub</a>
          <img class="wrap" src="images/ROCm_Logo_128.png" alt="ROCm_Logo" />
          <h2 id="project_title">ROCm, A New Era in Open GPU Computing</h2>
          <h3 id="project_tagline">Platform for GPU Enabled HPC and UltraScale Computing </h3>

        </header>
      
    </div>
   <div id="nav">
      <div id="nbar">
        <ul>
          <li><a href="index.html">Overview</a></li>
          <li><a href="install.html">Install</a></li>
          <li><a href="languages.html">Languages</a></li>
          <li><a href="documentation.html">Documentation</a></li>
          <li><a href="packages.html">ROCm Solutions </a></li>
          <li><a href="tutorials.html">Tutorials</a></li>
          <li><a href="hardware.html">ROCm Hardware</a></li>
          <li><a href="blog.html">ROCm BLOG</a></li>
        </ul>
      </div>
      </div>
    <!-- MAIN CONTENT -->
    <div id="main_content_wrap" class="outer">
      <section id="main_content" class="inner">
        <h3>The Art of AMDGCN Assembly: How to Bend the Machine to Your Will</h3>


The ability to write code in assembly is essential to achieving the best performance for a GPU program. In a <a href="http://gpuopen.com/rocm-with-harmony-combining-opencl-hcc-hsa-in-a-single-program/" target="_blank">previous blog</a> we described how to combine several languages in a single program using ROCm and Hsaco. This article explains how to produce Hsaco from assembly code and also takes a closer look at some new features of the GCN architecture. I'd like to thank Ilya Perminov of Luxsoft for co-authoring this blog post.

Programs written for GPUs should achieve the highest performance possible. Even carefully written ones, however, won’t always employ 100% of the GPU’s capabilities. Some reasons are the following:
<ul>
 	<li>The program may be written in a high level language that does not expose all of the features available on the hardware.</li>
 	<li>The compiler is unable to produce optimal ISA code, either because the compiler needs to ‘play it safe’ while adhering to the semantics of a language or because the compiler itself is generating un-optimized code.</li>
</ul>
Consider a program that uses one of GCN’s new features (source code is available on <a href="https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra" target="_blank">GitHub</a>). Recent hardware architecture updates—DPP and DS Permute instructions—enable efficient data sharing between wavefront lanes. To become more familiar with the instruction set, review the <a href="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf" target="_blank">GCN ISA Reference Guide</a>.

Note: the assembler is currently experimental; some of syntax we describe may change.
<h2>DS Permute Instructions</h2>
Two new instructions, <code>ds_permute_b32</code> and <code>ds_bpermute_b32</code>, allow VGPR data to move between lanes on the basis of an index from another VGPR. These instructions use LDS hardware to route data between the 64 lanes, but they don’t write to LDS memory. The difference between them is what to index: the source-lane ID or the destination-lane ID. In other words, <code>ds_permute_b32</code> says “put my lane data in lane i,” and <code>ds_bpermute_b32</code> says “read data from lane i.” The <a href="http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf" target="_blank">GCN ISA Reference Guide</a> provides a more formal description.

The test kernel is simple: read the initial data and indices from memory into GPRs, do the permutation in the GPRs and write the data back to memory. An analogous OpenCL kernel would have this form:

<codebox>
<pre>__kernel void hello_world(__global const uint * in, __global const uint * index, __global uint * out)
{
     size_t i = get_global_id(0);
     out[i] = in[ index[i] ];
}
</pre>
</codebox>
<h3>Passing Parameters to a Kernel</h3>
Formal HSA arguments are passed to a kernel using a special read-only memory segment called <code>kernarg</code>. Before a wavefront starts, the base address of the <code>kernarg</code> segment is written to an SGPR pair. The memory layout of variables in <code>kernarg</code> must employ the same order as the list of kernel formal arguments, starting at offset 0, with no padding between variables—except to honor the requirements of natural alignment and any align qualifier.

The example host program must create the <code>kernarg</code> segment and fill it with the buffer base addresses. The HSA host code might look like the following:

<codebox>
<pre>/*
* This is the host-side representation of the kernel arguments that the simplePermute kernel expects.
*/
struct simplePermute_args_t {
	uint32_t * in;
	uint32_t * index;
	uint32_t * out;
};
/*
* Allocate the kernel-argument buffer from the correct region.
*/
hsa_status_t status;
simplePermute_args_t * args = NULL;
status = hsa_memory_allocate(kernarg_region, sizeof(simplePermute_args_t), (void**)(&amp;args));
assert(HSA_STATUS_SUCCESS == status);
aql-&gt;kernarg_address = args;
/*
* Write the args directly to the kernargs buffer;
* the code assumes that memory is already allocated for the 
* buffers that in_ptr, index_ptr and out_ptr point to
*/
args-&gt;in = in_ptr;
args-&gt;index = index_ptr;
args-&gt;out = out_ptr;
</pre>
</codebox>

The host program should also allocate memory for the <code>in</code>, <code>index</code> and <code>out</code> buffers. In the GitHub repository, all the run-time-related stuff is hidden in the Dispatch and Buffer classes, so the sample code looks much cleaner:

<codebox>
<pre>// Create Kernarg segment
if (!AllocateKernarg(3 * sizeof(void*))) { return false; }

// Create buffers
Buffer *in, *index, *out;
in = AllocateBuffer(size);
index = AllocateBuffer(size);
out = AllocateBuffer(size);

// Fill Kernarg memory
Kernarg(in); // Add base pointer to “in” buffer
Kernarg(index); // Append base pointer to “index” buffer
Kernarg(out); // Append base pointer to “out” buffer
</pre>
</codebox>
<h2>Initial Wavefront and Register State</h2>
To launch a kernel in real hardware, the run time needs information about the kernel, such as
<ul>
 	<li>The LDS size</li>
 	<li>The number of GPRs</li>
 	<li>Which registers need initialization before the kernel starts</li>
</ul>
All this data resides in the <code>amd_kernel_code_t</code> structure. A full description of the structure is available in the <a href="https://github.com/RadeonOpenCompute/ROCm-Docs/blob/master/AMDGPU-ABI.md" target="_blank">AMDGPU-ABI</a> specification. This is what it looks like in source code:

<codebox>
<pre>.hsa_code_object_version 2,0
.hsa_code_object_isa 8, 0, 3, "AMD", "AMDGPU"

.text
.p2align 8
.amdgpu_hsa_kernel hello_world

hello_world:

.amd_kernel_code_t
enable_sgpr_kernarg_segment_ptr = 1
is_ptr64 = 1
compute_pgm_rsrc1_vgprs = 1
compute_pgm_rsrc1_sgprs = 0
compute_pgm_rsrc2_user_sgpr = 2
kernarg_segment_byte_size = 24
wavefront_sgpr_count = 8
workitem_vgpr_count = 5
.end_amd_kernel_code_t

s_load_dwordx2  s[4:5], s[0:1], 0x10
s_load_dwordx4  s[0:3], s[0:1], 0x00
v_lshlrev_b32  v0, 2, v0
s_waitcnt     lgkmcnt(0)
v_add_u32     v1, vcc, s2, v0
v_mov_b32     v2, s3
v_addc_u32    v2, vcc, v2, 0, vcc
v_add_u32     v3, vcc, s0, v0
v_mov_b32     v4, s1
v_addc_u32    v4, vcc, v4, 0, vcc
flat_load_dword  v1, v[1:2]
flat_load_dword  v2, v[3:4]
s_waitcnt     vmcnt(0) &amp; lgkmcnt(0)
v_lshlrev_b32  v1, 2, v1
ds_bpermute_b32  v1, v1, v2
v_add_u32     v3, vcc, s4, v0
v_mov_b32     v2, s5
v_addc_u32    v4, vcc, v2, 0, vcc
s_waitcnt     lgkmcnt(0)
flat_store_dword  v[3:4], v1
s_endpgm
</pre>
</codebox>

Currently, a programmer must manually set all non-default values to provide the necessary information. Hopefully, this situation will change with new updates that bring automatic register counting and possibly a new syntax to fill that structure.

Before the start of every wavefront execution, the GPU sets up the register state on the basis of the <code>enable_sgpr_* and enable_vgpr_*</code> flags. VGPR <code>v0</code> is always initialized with a work-item ID in the <em>x</em> dimension. Registers <code>v1</code> and <code>v2</code> can be initialized with work-item IDs in the <em>y</em> and <em>z</em> dimensions, respectively. Scalar GPRs can be initialized with a work-group ID and work-group count in each dimension, a dispatch ID, and pointers to <code>kernarg</code>, the <code>aql</code> packet, the <code>aql</code> queue, and so on. Again, the <a href="https://github.com/RadeonOpenCompute/ROCm-Docs/blob/master/AMDGPU-ABI.md#initial-kernel-register-state" target="_blank">AMDGPU-ABI</a> specification contains a full list in in the section on initial register state. For this example, a 64-bit base <code>kernarg</code> address will be stored in the <code>s[0:1]</code> registers <code>(enable_sgpr_kernarg_segment_ptr = 1)</code>, and the work-item thread ID will occupy <code>v0</code> (by default). Below is the scheme showing initial state for our kernel.

<img class="size-large wp-image-3535 aligncenter" src="http://gpuopen.com/wp-content/uploads/2016/06/initial_state-1024x516.png" alt="initial_state" width="640" height="323" />
<h2>The GPR Counting</h2>
The next <code>amd_kernel_code_t</code> fields are obvious: <code>is_ptr64 = 1</code> says we are in 64-bit mode, and <code>kernarg_segment_byte_size = 24</code> describes the <code>kernarg</code> segment size. The GPR counting is less straightforward, however. The <code>workitem_vgpr_count</code> holds the number of vector registers that each work item uses, and <code>wavefront_sgpr_count</code> holds the number of scalar registers that a wavefront uses. The code above employs <code>v0–v4</code>, so <code>workitem_vgpr_count = 5</code>. But <code>wavefront_sgpr_count = 8</code> even though the code only shows <code>s0–s5</code>, since the special registers <code>VCC</code>, <code>FLAT_SCRATCH</code> and <code>XNACK</code> are physically stored as part of the wavefront’s SGPRs in the highest-numbered SGPRs. In this example, <code>FLAT_SCRATCH</code> and <code>XNACK</code> are disabled, so <code>VCC</code> has only two additional registers.

In current GCN3 hardware, VGPRs are allocated in groups of 4 registers and SGPRs in groups of 16. Previous generations (GCN1 and GCN2) have a VGPR granularity of 4 registers and an SGPR granularity of 8 registers. The fields <code>compute_pgm_rsrc1_*gprs</code> contain a device-specific number for each register-block type to allocate for a wavefront. As we said previously, future updates may enable automatic counting, but for now you can use following formulas for all three GCN GPU generations:
<codebox>
<pre>compute_pgm_rsrc1_vgprs = (workitem_vgpr_count-1)/4

compute_pgm_rsrc1_sgprs = (wavefront_sgpr_count-1)/8
</pre>
</codebox>

Now consider the corresponding assembly:

<codebox>
<pre>// initial state:
//   s[0:1] - kernarg base address
//   v0 - workitem id

s_load_dwordx2  s[4:5], s[0:1], 0x10  // load out_ptr into s[4:5] from kernarg
s_load_dwordx4  s[0:3], s[0:1], 0x00  // load in_ptr into s[0:1] and index_ptr into s[2:3] from kernarg
v_lshlrev_b32  v0, 2, v0              // v0 *= 4;
s_waitcnt     lgkmcnt(0)              // wait for memory reads to finish

// compute address of corresponding element of index buffer
// i.e. v[1:2] = &amp;index[workitem_id]
v_add_u32     v1, vcc, s2, v0
v_mov_b32     v2, s3
v_addc_u32    v2, vcc, v2, 0, vcc

// compute address of corresponding element of in buffer
// i.e. v[3:4] = &amp;in[workitem_id]
v_add_u32     v3, vcc, s0, v0
v_mov_b32     v4, s1
v_addc_u32    v4, vcc, v4, 0, vcc

flat_load_dword  v1, v[1:2] // load index[workitem_id] into v1
flat_load_dword  v2, v[3:4] // load in[workitem_id] into v2
s_waitcnt     vmcnt(0) &amp; lgkmcnt(0) // wait for memory reads to finish

// v1 *= 4; ds_bpermute_b32 uses byte offset and registers are dwords
v_lshlrev_b32  v1, 2, v1

// perform permutation
// temp[thread_id] = v2
// v1 = temp[v1]
// effectively we got v1 = in[index[thread_id]]
ds_bpermute_b32  v1, v1, v2

// compute address of corresponding element of out buffer
// i.e. v[3:4] = &amp;out[workitem_id]
v_add_u32     v3, vcc, s4, v0
v_mov_b32     v2, s5
v_addc_u32    v4, vcc, v2, 0, vcc

s_waitcnt     lgkmcnt(0) // wait for permutation to finish

// store final value in out buffer, i.e. out[workitem_id] = v1
flat_store_dword  v[3:4], v1

s_endpgm
</pre>
</codebox>
<h2>Compiling GCN ASM Kernel Into Hsaco</h2>
The next step is to produce a Hsaco from the ASM source. LLVM has added support for the AMDGCN assembler, so you can use Clang to do all the necessary magic:
<codebox>
<pre>clang -x assembler -target amdgcn--amdhsa -mcpu=fiji -c -o test.o asm_source.s

clang -target amdgcn--amdhsa test.o -o test.co
</pre>
</codebox>

The first command assembles an object file from the assembly source, and the second one links everything (you could have multiple source files) into a Hsaco. Now, you can load and run kernels from that Hsaco in a program. The <a href="https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra" target="_blank">GitHub examples</a> use Cmake to automatically compile ASM sources.

In a future post we will cover DPP, another GCN cross-lane feature that allows vector instructions to grab operands from a neighboring lane.
<h2>References</h2>
<ol>
 	<li><a href="http://developer.amd.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf" target="_blank">GCN3 ISA</a></li>
 	<li><a href="https://github.com/RadeonOpenCompute/ROCm-Docs/blob/master/AMDGPU-ABI.md" target="_blank">Hsaco and amd_kernel_code_t description</a></li>
 	<li><a href="https://github.com/RadeonOpenCompute/LLVM-AMDGPU-Assembler-Extra" target="_blank">GitHub code with ASM examples</a></li>
</ol>
<i>Ilya Perminov is a software engineer at Luxoft. He earned his PhD in computer graphics in 2014 from ITMO University in Saint Petersburg, Russia. Ilya interned at AMD in 2015, during which time he worked on graphics-workload tracing and performance modeling. His research interests include real-time rendering techniques, GPUs architecture and GPGPU.</i>
</div>

 
  </body>