This document is a reference for software developers who want to optimize code for Graviton Arm64 processors at the assembly level. Some of the documented transforms could be applied to optimize code written in higher level languages like C or C++, and in that respect this document is not restrictive to assembly level programming. This guide isn’t intended cover every aspect of writing in assembly, such as assembler syntax or ABI details.
The code patterns in this document could also be useful to spot inefficient code generated for the hot code. Profilers such as Linux perf allow the inspection of the assembly level code. During the review of the hot code, software developers can reference this guide to find better ways to optimize inner loops.
To develop code for AWS Gravition, we always recommend developing and testing on Graviton itself to ensure that the optimizations you make are targeted toward Graviton. The microarchitecture of other Arm64 cores may be different in ways that lead to different optimizations.
Some techniques for writing optimized assembly:
- Be aware of instruction level parallelism
- Split Data Dependency Chains
- Modulo Scheduling
- Test Everything
- Specialize functions for known input conditions
- Use Efficient Instructions
We will be adding more sections to this document soon, so check back!
This guide aims only to provide tips on optimizing assembly or intrinsics code. There are other resources available on the web which can help you to learn concepts and others which you can refer to as references. Here is a list of helpful pages.
- https://mariokartwii.com/armv8/ - a full book on how to get start writing ARMv8 assembly
- https://github.com/pkivolowitz/asm_book - another guide to get starting writing assembly
- https://github.com/slothy-optimizer/slothy - a tool which can optimize instruction scheduling, register allocation, and software pipeline of assembly functions
- https://developer.arm.com/architectures/instruction-sets/intrinsics - a reference of intrinsics functions when you want to make direct use of SIMD instructions from C/C++
- https://dougallj.github.io/asil/index.html - a reference of SVE instructions and their intrinsic function names by SVE version and extension feature availability
- https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#the-base-procedure-call-standard - a reference to the calling convention for aarch64, which is particularly helpful to know when registers to avoid or to spill to the stack when they are needed for use
- https://developer.arm.com/documentation/ddi0602/2024-03 - the complete reference for instructions in the Arm A-profile A64 ISA, which can be downloaded as a PDF
When writing in C, or any higher level language, the programmer writes a sequence of statements which are presumed to execute in strict sequence. In the following example, each statement logically occurs after the previous one.
int foo (int a, int b, int c)
{
if (c < 0) {
int d = a + b;
return d;
} else {
int e = a * b;
return e;
}
}
However when we compile this code with gcc version 7.3 with -O1
, this is the output:
foo(int, int, int):
add w3, w0, w1
mul w0, w0, w1
cmp w2, 0
csel w0, w0, w3, ge
ret
Here we can see the add and the multiply instruction are executed independently
of the result of c < 0
. The compiler knows that the CPU can execute these
instructions in parallel. If we use the machine code analyzer from the LLVM
project, we can can see what is happening. We will use Graviton2 as our target
platform, which uses the Neoverse-N1 core.
llvm-mca -mcpu=neoverse-n1 -timeline -iterations 1 foo.s
...
Timeline view:
Index 012345
[0,0] DeER . add w3, w0, w1
[0,1] DeeER. mul w0, w0, w1
[0,2] DeE-R. cmp w2, #0
[0,3] D==eER csel w0, w0, w3, ge
[0,4] DeE--R ret
...
In this output we can see that all five instructions from this function are decoded
in parallel in the first cycle. Then in the second cycle the instructions for which
all dependencies are already resolved begin executing. There is only one instruction
which has dependencies: the conditional select, csel
. The four remaining
instructions all begin executing, including the ret
instruction at the end. In
effect, nearly the entire C function is executing in parallel. The
csel
has to wait for the result of the comparison instruction to check
c < 0
and the result of the multiplication. Even the return has
completed by this point and the CPU front end has already decoded
instructions in the return path and some of them will already be
executing as well.
It is important to understand this when writing assembly. Instructions do not execute in the order they appear. Their effects will be logically sequential, but execution order will not be.
When writing assembly for arm64 or any platform, be aware that the CPU has more than one pipeline of different types. The types and quantities are outlined in the Software Optimization Guide, often abbreviated as SWOG. The guide for each Graviton processor is linked from the main page of this technical guide. Using this knowledge, the programmer can arrange instructions of different types next to each other to take advantage of instruction level parallelism, ILP. For example, interleaving load instructions with vector or floating point instructions can keep both pipelines busy.
As we saw in the last section, Graviton has multiple pipelines or execution units which can execute instructions. The instructions may execute in parallel if all of the input dependencies have been met but a series of instructions each of which depend on a result from the previous one will not be able to efficiently utilize the resources of the CPU.
For example, a simple C function which takes 64 signed 8-bit integers and adds them all up into one value could be implemented like this:
int16_t add_64(int8_t *d) {
int16_t sum = 0;
for (int i = 0; i < 64; i++) {
sum += d[i];
}
return sum;
}
We could write this in assembly using NEON SIMD instructions like this:
add_64_neon_01:
ld1 {v0.16b, v1.16b, v2.16b, v3.16b} [x0] // load all 64 bytes into vector registers v0-v3
movi v4.2d, #0 // zero v4 for use as an accumulator
saddw v4.8h, v4.8h, v0.8b // add bytes 0-7 of v0 to the accumulator
saddw2 v4.8h, v4.8h, v0.16b // add bytes 8-15 of v0 to the accumulator
saddw v4.8h, v4.8h, v1.8b // ...
saddw2 v4.8h, v4.8h, v1.16b
saddw v4.8h, v4.8h, v2.8b
saddw2 v4.8h, v4.8h, v2.16b
saddw v4.8h, v4.8h, v3.8b
saddw2 v4.8h, v4.8h, v3.16b
addv h0, v4.8h // horizontal add all values in the accumulator to h0
fmov w0, h0 // copy vector registor h0 to general purpose register w0
ret // return with the result in w0
In this example, we use a signed add-wide instruction which adds the top and bottom of each register to v4 which is used to accumulate the sum. This is the worst case for data dependency chains because every instruction depends on the result of the previous one which will make it impossible for the CPU to achieve any instruction level parallelism. If we use llvm-mca
to evaluate it we can see this clearly.
Timeline view:
0123456789
Index 0123456789 0123456
[0,0] DeeER. . . . .. movi v4.2d, #0000000000000000
[0,1] D==eeER . . . .. saddw v4.8h, v4.8h, v0.8b
[0,2] D====eeER . . . .. saddw2 v4.8h, v4.8h, v0.16b
[0,3] D======eeER . . .. saddw v4.8h, v4.8h, v1.8b
[0,4] D========eeER . . .. saddw2 v4.8h, v4.8h, v1.16b
[0,5] D==========eeER. . .. saddw v4.8h, v4.8h, v2.8b
[0,6] D============eeER . .. saddw2 v4.8h, v4.8h, v2.16b
[0,7] D==============eeER . .. saddw v4.8h, v4.8h, v3.8b
[0,8] D================eeER .. saddw2 v4.8h, v4.8h, v3.16b
[0,9] D==================eeeeER.. addv h0, v4.8h
[0,10] D======================eeER fmov w0, h0
[0,11] DeE-----------------------R ret
One way to break data dependency chains is to use commutative property of addition and change the order which the adds are completed. Consider the following alternative implementation which makes use of pairwise add instructions.
add_64_neon_02:
ld1 {v0.16b, v1.16b, v2.16b, v3.16b} [x0]
saddlp v0.8h, v0.16b // signed add pairwise long
saddlp v1.8h, v1.16b
saddlp v2.8h, v2.16b
saddlp v3.8h, v3.16b // after this instruction we have 32 16-bit values
addp v0.8h, v0.8h, v1.8h // add pairwise again v0 and v1
addp v2.8h, v2.8h, v3.8h // now we are down to 16 16-bit values
addp v0.8h, v0.8h, v2.8h // 8 16-bit values
addv h0, v4.8h // add 8 remaining values across vector
fmov w0, h0
ret
In this example the first 4 instructions after the load can execute independently and then the next 2 are also independent of each other. However, the last 3 instructions do have data dependencies on each other. If we take a look with llvm-mca
again, we can see that this implementation takes 17 cycles (excluding the initial load instruction common to both implementations) and the original takes 27 cycles.
Timeline view:
0123456
Index 0123456789
[0,0] DeeER. . .. saddlp v0.8h, v0.16b
[0,1] DeeER. . .. saddlp v1.8h, v1.16b
[0,2] D=eeER . .. saddlp v2.8h, v2.16b
[0,3] D=eeER . .. saddlp v3.8h, v3.16b
[0,4] D==eeER . .. addp v0.8h, v0.8h, v1.8h
[0,5] D===eeER . .. addp v2.8h, v2.8h, v3.8h
[0,6] D=====eeER. .. addp v0.8h, v0.8h, v2.8h
[0,7] .D======eeeeeER.. addv h0, v0.8h
[0,8] .D===========eeER fmov w0, h0
[0,9] .DeE------------R ret
When writing a loop in assembly that contains several dependent steps in each iteration but there are no dependencies between iterations (or only superficial dependencies), modulo scheduling can improve performance. Modulo scheduling is the processes of combining loop unrolling with the interleaving of steps from other iterations of the loop. For example, suppose we have a loop which has four steps, A, B, C, and D. We can unroll that loop to execute 4 iterations in parallel with vector instructions such that we complete steps like this: A₀₋₃, B₀₋₃, C₀₋₃, D₀₋₃. However this construction still has a dependency chain which the last section taught us to avoid. We can break that chain by scheduling the steps with modulo scheduling.
A1
B1 A2
C1 B2 A3
D1 C2 B3 A4
D2 C3 B4
D3 C4
D4
Take the following example written in C. I noted step A through step D in the source, but the lines in C don’t map cleanly onto the steps I selected for implementation in assembly.
void modulo_scheduling_example_c_01(int16_t *destination, int dest_len,
int8_t *src, uint32_t *src_positions,
int16_t *multiplier)
{
int i;
for (i = 0; i < dest_len * 8; i++)
{
// step A: load src_position[i] and multiplier[i]
int src_pos = src_positions[i];
// step B: load src[src_pos] and copy the value to a vector register
// step C: sign extend int8_t to int16_t, multiply, and saturate-shift right
int32_t val = multiplier[i] * src[src_pos];
destination[i] = min(val >> 3, (1 << 15) - 1);
// step D: store the result to destination[i]
}
}
A first pass implementation of this function using Neon instructions to unroll the loop could look like this:
Click here to see the (long) assembly implementation.
// x0: destination
// w1: dest_len
// x2: src
// x3: src_positions
// x4: multiplier
modulo_scheduling_example_asm_01:
stp x19, x20, [sp, #-16]!
1:
// src_pos = src_positions[i]
ldp w5, w6, [x3]
ldp w7, w8, [x3, 8]
ldp w9, w10, [x3, 16]
ldp w11, w12, [x3, 24]
add x3, x3, #32
// src[src_pos]
ldrb w13, [x2, w5, UXTW]
ldrb w14, [x2, w6, UXTW]
ldrb w15, [x2, w7, UXTW]
ldrb w16, [x2, w8, UXTW]
ldrb w17, [x2, w9, UXTW]
ldrb w18, [x2, w10, UXTW]
ldrb w19, [x2, w11, UXTW]
ldrb w20, [x2, w12, UXTW]
// copy to vector reg
ins v0.8b[0], w13
ins v0.8b[1], w14
ins v0.8b[2], w15
ins v0.8b[3], w16
ins v0.8b[4], w17
ins v0.8b[5], w18
ins v0.8b[6], w19
ins v0.8b[7], w20
// sign extend long, convert int8_t to int16_t
sxtl v0.8h, v0.8b
// multiplier[i]
ld1 {v1.8h}, [x4], #16
// multiply
smull v2.4s, v1.4h, v0.4h
smull2 v3.4s, v1.8h, v0.8h
// saturating shift right
sqshrn v2.4h, v2.4s, #3
sqshrn2 v2.8h, v3.4s, #3
st1 {v2.8h}, [x0], #16
subs w1, w1, #1
b.gt 1b
ldp x19, x20, [sp], #16
ret
Lets’s benchmark both implementations and see what the results are:
modulo_scheduling_example_c_01 - runtime: 1.598776 s
modulo_scheduling_example_asm_01 - runtime: 1.492209 s
The assembly implementation is faster, which is great, but lets see how we can do better. We could take this a step further and employ modulo scheduling. This increases the complexity of the implementation, but for hot functions which execute many iterations, this may be worth it. In order to make this work, we have to prime the loop with a prologue section:
A1
B1 A2
C1 B2 A3
Then we run a steady state loop:
D1 C2 B3 A4
And finally an epilogue to complete the iterations:
D2 C3 B4
D3 C4
D4
In order to make this work, each step must consume all of the outputs of the
previous step and not use as temporary registers any of the inputs or outputs of
other steps. For this example, step A loads the value of multiplier[i]
but it
is not consumed until step C, so in step B we simply copy its value with a mov
instruction. This implementation could look like this:
See the implementations of each step whicih are the same as above, (hidden for brevity).
.macro modulo_scheduling_example_asm_step_A
// src_pos = src_positions[i]
ldp w5, w6, [x3]
ldp w7, w8, [x3, 8]
ldp w9, w10, [x3, 16]
ldp w11, w12, [x3, 24]
add x3, x3, #32
// multiplier[i]
ld1 {v5.8h}, [x4], #16
.endm
.macro modulo_scheduling_example_asm_step_B
mov v1.16b, v5.16b
// src[src_pos]
ldrb w13, [x2, w5, UXTW]
ldrb w14, [x2, w6, UXTW]
ldrb w15, [x2, w7, UXTW]
ldrb w16, [x2, w8, UXTW]
ldrb w17, [x2, w9, UXTW]
ldrb w18, [x2, w10, UXTW]
ldrb w19, [x2, w11, UXTW]
ldrb w20, [x2, w12, UXTW]
// copy to vector reg
ins v0.8b[0], w13
ins v0.8b[1], w14
ins v0.8b[2], w15
ins v0.8b[3], w16
ins v0.8b[4], w17
ins v0.8b[5], w18
ins v0.8b[6], w19
ins v0.8b[7], w20
.endm
.macro modulo_scheduling_example_asm_step_C
// sign extend long, convert int8_t to int16_t
sxtl v0.8h, v0.8b
// multiply
smull v2.4s, v1.4h, v0.4h
smull2 v3.4s, v1.8h, v0.8h
// saturating shift right
// val = min(val >> 3, (1 << 15) - 1)
sqshrn v4.4h, v2.4s, #3
sqshrn2 v4.8h, v3.4s, #3
.endm
.macro modulo_scheduling_example_asm_step_D
// step D
// destination[i] = val
st1 {v4.8h}, [x0], #16
.endm
.global modulo_scheduling_example_asm_02
modulo_scheduling_example_asm_02:
cmp w1, #4
b.lt modulo_scheduling_example_asm_01
stp x19, x20, [sp, #-16]!
modulo_scheduling_example_asm_step_A // A1
modulo_scheduling_example_asm_step_B // B1
modulo_scheduling_example_asm_step_A // A2
modulo_scheduling_example_asm_step_C // C1
modulo_scheduling_example_asm_step_B // B2
modulo_scheduling_example_asm_step_A // A3
1: //// loop ////
modulo_scheduling_example_asm_step_D
modulo_scheduling_example_asm_step_C
modulo_scheduling_example_asm_step_B
modulo_scheduling_example_asm_step_A
sub w1, w1, #1
cmp w1, #3
b.gt 1b
//// end loop ////
modulo_scheduling_example_asm_step_D // DN-2
modulo_scheduling_example_asm_step_C // CN-1
modulo_scheduling_example_asm_step_B // BN
modulo_scheduling_example_asm_step_D // DN-1
modulo_scheduling_example_asm_step_C // CN
modulo_scheduling_example_asm_step_D // DN
// restore stack
ldp x19, x20, [sp], #16
ret
If we benchmark these functions on Graviton3, this is the result:
modulo_scheduling_example_c_01 - runtime: 1.598776 s
modulo_scheduling_example_asm_01 - runtime: 1.492209 s
modulo_scheduling_example_asm_02 - runtime: 1.583097 s
It’s still faster than the C implementation, but it’s slower than our first
pass. Let’s add another trick. If we check the software optimization guide for
Graviton2, the Neoverse N1, we see that the ins
instruction has a whopping
5-cycle latency with only 1 instruction per cycle of total throughput. There are
8 ins
instructions which will take at least 13 cycles to complete, the whole
time tying up vector pipeline resources.
If we instead outsource that work to the load and store pipelines, we can leave
the vector pipelines to process the math instructions in step C. To do this,
after each byte is loaded, we write them back to memory on the stack with strb
instructions to lay down the bytes in the order that we need them to be in the
vector register. Then in the next step we use a ld1 instruction to put those
bytes into a vector register. This approach probably has more latency (but I
haven’t calculated it), but since the next step which consumes the outputs is
three steps away, it leaves time for those instructions to complete before we
need those results. For this technique, we revise only the macros for steps B
and C:
.macro modulo_scheduling_example_asm_03_step_B
mov v4.16b, v5.16b
strb w13, [sp, #16]
strb w14, [sp, #17]
strb w15, [sp, #18]
strb w16, [sp, #19]
strb w17, [sp, #20]
strb w18, [sp, #21]
strb w19, [sp, #22]
strb w20, [sp, #23]
.endm
.macro modulo_scheduling_example_asm_03_step_C
mov v1.16b, v4.16b
ldr d0, [sp, #16]
.endm
This technique gets the best of both implementations and running the benchmark again produces these results:
modulo_scheduling_example_c_01 - runtime: 1.598776 s
modulo_scheduling_example_asm_01 - runtime: 1.492209 s
modulo_scheduling_example_asm_02 - runtime: 1.583097 s
modulo_scheduling_example_asm_03 - runtime: 1.222440 s
Nice. Now we have the fastest implementation yet with a 24% speed improvement. But wait! What if the improvement is only from the trick where we save the intermediate values on the stack in order to get them into the right order and modulo scheduling isn’t helping. Let’s benchmark the case where we use this trick without modulo scheduling.
modulo_scheduling_example_asm_04 - runtime: 2.106044 s
Ooof. That’s the slowest implementation yet. The reason this trick works is because it divides work up between vector and memory pipelines and stretches out the time until the results are required to continue the loop. Only when we do both of these things, do we achieve this performance.
I didn’t mention SVE gather load instructions on purpose. They would most definitely help here, and we will explore that in the future.
Test everything and make no assumptions. When you are writing hand optimized assembly or intrinsics to achieve the best performance, speed can be impacted in surprising and unintuitive ways even for experienced assembly developers, so it’s important to create micro benchmarks for every function that you work on. It’s also important to consider how the function will be called. For example, a kernel in a video encoder for computing the sum-absolute-difference of two frames of video will be called repeatedly in a tight loop with similar or identical arguments which will give the branch predictor and instruction caches plenty of opportunity to train themselves to peak performance. Microbenchmarks to test these types of functions in loops with fixed arguments are very useful to measure performance. However, something like mempcy is called on irregular intervals possibly with different arguments for each call. Benchmarking a function like this in a tight loop may lead to misleading results since a tight loop will over-train the branch predictor leading the same path being taken each time, whereas in real workloads data dependent branches will incur a more significant penalty. To properly benchmark memcpy we would need to benchmark it with a distribution of length arguments similar to what you expect in a production workload, which will vary widely.
There is no one-size-fits all approach for benchmarking, so as you design your test harness, consider how the function will be used and design accordingly. You may even want several types of microbenchmarks to evaluate changes to your function across different conditions. Most functions written in assembly should avoid the need for knowledge about an external structure so the author does not have to make assumptions about the layout of memory which could be changed by the compiler. This also makes it easier to extract these functions as small kernels for testing separately from a large application in a standalone test harness. A standalone test harness is very convenient, since it is likely to build much more quickly than a large application the assembly function is a part of, making iterative development quicker. In addition to testing a wide range of input conditions, which is the case for any function in any language, assembly functions should also be tested to ensure they do not violate the calling convention. For example, some registers should not be modified, or if they are, their contents should be spilled to the stack and reloaded before returning. For some functions, especially if the input affects locations of memory dereferencing, it is necessary to fuzz the function by generating random input arguments, or all permutations, if it is feasible.
While testing is always important in software development, it is especially important for assembly programming. Choose a test method that works for your case and make thorough use of it.
Functions that are important enough to optimize with assembly often take very predictable or specific code paths that are the same every time they are called. For example, a hot function written in C may take an argument which specifies an array length which the C will use in a loop termination condition. The C may be written to accept any possible input for that length. However when we write the function in assembly, it can be advantageous to make assumptions about the arguments, such as “it will always be divisible by 2, 4, 8, or 16.” We can handle these cases by checking if any of these assumptions is true at the top of the function and then branch to a specialized portion of the function. Then the specialized assembly can omit more general code to handle any possible input and make use of loop unrolling or SIMD instructions which divide evenly into the array length.
Another technique is to call hot functions through a function pointer. Before setting the function pointer, if the input arguments can be verified to always satisfy some conditions, like the length is divisible 16 and always non-zero, we can set the pointer to an assembly function which is optimized based on those assumptions. Authors of video codecs will be very familiar with this technique. In fact, many encoders have a great deal of infrastructure just to dispatch work to specialized functions which match the input conditions.
Here is an example:
int32_t sum_all(int8_t *values, int length)
{
int32_t sum = 0;
for (int i = 0; i < length; i++)
sum += values[i];
return sum;
}
If this is only ever called when length is divisible by 16 and non-zero, we can implement in assembly like this:
// x0 - int8_t *values
// w1 - int length
sum_all_x16_asm:
movi v2.4s, #0 // zero out v2 for use as an accumulator
1: ld1 {v0.16b}, [x0], #16 // load 16 bytes from values[i]
saddlp v1.8h, v0.16b // pairwise-add-long (A, B, C, D, ...) => (A+B, C+D, ...)
sadalp v2.4s, v1.8h // pairwise-add-accumulate-long (same, but add to v2)
subs w1, w1, #16 // check the loop counter
b.gt 1b // branch back if more iterations remain
addv s0, v2.4s // add-across the accumulator
fmov w0, s0 // copy to general purpose register
ret
In this implementation, we can skip the initial check to see if length is zero and we can skip any checks for any unconsumed values if length was not divisible by 16. A more complex function may be able to be unrolled farther or make more assumptions about type conversions which are knowable based on the inputs, which can improve execution speed even more.
Make yourself aware of the various types of instructions available to use. Choosing the right instructions can make the difference between a 5 or 10 instructions sequence or just one instruction that does exactly what you need. For example, saturating arithmetic instructions can replace a series of instructions with just one. A rounding-shift-right-narrow can compute an average of two integers with just two instructions instead of three with a truncating right shift and in fewer cycles than the very expensive divide instruction.
One way this is especially true is the use of vector or instructions. We have already mentioned Neon instructions in this document, but to make it clearer, these instructions operation on multiple data lanes in parallel. They are also known as SIMD or Single Instruction Multiple Data instructions. The full details on how to use this type of programming is beyond the scope of this guide, but when it’s possible to process 16 bytes in parallel instead of just one at a time, the speed up can be significant.
Another way to use efficient instructions is to consult the software
optimization guide.
Many different combinations of instructions can accomplish the same result, and
some are obviously better than others. Some instructions, like the
absolute-difference-accumulate-long instructions (sabal
and uabal
) can only
be executed on half of the vector pipelines in Neoverse V1, which slashes the
overall throughput of a series of independent instructions to two per cycle.
Using instead a series of absolute-difference instructions (sabd
or uabd
),
which can execute on all four pipelines, can increase the parallelism. If there
are enough of these instructions, this is worth the trade off of requiring a
separate step to add up the accumulating absolute difference. There are other
instructions which this applies to, so consult the software optimization guide
to see what the throughput and latency numbers are for the instructions you are
using.
Expand to see a code sample using the two different approaches for calculating sum-absolute-difference.
The first function in this example uses `uabal` which is limited to half of the NEON execution units. The second uses more instructions to complete the final sum, but since the instructions can execute on all NEON execution units, it runs 25% faster on Graviton3.This is a C implementation of the functions below for illustration.
int sum_absolute_difference_c(uint8_t *a, uint8_t *b, int h)
{
int s = 0;
for (int i = 0; i < h*64; i++) {
s += abs(a[i] - b[i]);
}
return s;
}
.text
.global sad_asm_01
.global sad_asm_02
sad_asm_01:
movi v18.2d, #0 // write zeros to the accumulation register
1:
ld1 {v0.16b-v3.16b}, [x0], #64 // Load 64 bytes from uint8_t *a
ld1 {v4.16b-v7.16b}, [x1], #64 // Load 64 bytes from uint8_t *b
uabdl v16.8h, v0.8b, v4.8b // sad long 0-7 (clears v16)
uabal2 v16.8h, v0.16b, v4.16b // sad accumulate long 8-15
uabal v16.8h, v1.8b, v5.8b // sad accumulate long 16-23
uabal2 v16.8h, v1.16b, v5.16b // sad accumulate long 24-31
uabal v16.8h, v2.8b, v6.8b // sad accumulate long 31-39
uabal2 v16.8h, v2.16b, v6.16b // sad accumulate long 40-47
uabal v16.8h, v3.8b, v7.8b // sad accumulate long 48-57
uabal2 v16.8h, v3.16b, v7.16b // sad accumulate long 58-63
uadalp v18.4s, v16.8h // add accumulate pairwise long into the 32 bit accumluator
subs w2, w2, #1 // decrement the loop counter
b.gt 1b // branch to the top of the loop
addv s0, v18.4s // add across vector to add up the 32 bit values in the accumlutor (v18)
fmov w0, s0 // move vector register s0 to 32 bit GP register, w0
ret
sad_asm_02:
movi v21.2d, #0 // write zeros to the accumulation registers
1:
ld1 {v0.16b-v3.16b}, [x0], #64 // Load 64 bytes from uint8_t *a
ld1 {v4.16b-v7.16b}, [x1], #64 // Load 64 bytes from uint8_t *b
uabd v17.16b, v0.16b, v4.16b // sad 0-15
uabd v18.16b, v1.16b, v5.16b // sad 16-31
uabd v19.16b, v2.16b, v6.16b // sad 32-47
uabd v20.16b, v3.16b, v7.16b // sad 48-63
uaddlp v17.8h, v17.16b // add long pairwise each of the result registers from above
uaddlp v18.8h, v18.16b // after these four instructions, we have converted 64 8-bit values to
uaddlp v19.8h, v19.16b // 32 16-bit sums
uaddlp v20.8h, v20.16b
addp v17.8h, v17.8h, v18.8h // add pairwise again converting 32 16-bit values to 16 16-bit values
addp v18.8h, v19.8h, v20.8h // second step to get to 16 values
addp v17.8h, v17.8h, v18.8h // add pairwise again, converting 16 16-bit values to 8 16-bit values
uaddlp v17.4s, v17.8h // add pairwise long, converting 8 16-bit values to 4 32-bit values
add v21.4s, v21.4s, v17.4s // accumulate these 4 values into v21 for the next iteration
subs w2, w2, #1 // decrement the loop counter
b.gt 1b // branch to the top of the loop
addv s0, v21.4s // add across vector to add up the 32 bit values in the accumluator (v21)
fmov w0, s0 // move vector register s0 to the 32 bit GP register, w0
ret
If the second approach is faster, why would we ever use sabal
or uabal
?
There is a trade off, of course. As you can see in the example, it takes more
instructions to avoid using uabal
because you have to compute the accumulation
another way. It works well in this example because we can parallelize the SAD
computation across 64 bytes. If the amount of loop unrolling was more limited
(maybe because a result from the previous iteration is needed in the next), we
might not be able to parallelize enough SAD computations to keep 4 execution
units full. In that case it would probably make sense to use uabal
since we
would already be unable to fill up the queues of 4 execution units.
Check to see what the compiler or several different compilers produce to see if it can teach you any tricks. Sometimes just writing a short segment of code in C in Compiler Explorer and seeing what the latest versions of GCC and Clang produce with various levels of optimization can reveal tricks which you may not have thought of. This doesn’t always produce useful results, especially since writing in assembly suggests that the compiler’s output was not good enough, but it can be a good way to explore ideas for implementing short segments of code.
Choosing the right instructions in an efficient way is something that takes practice and familiarity with the instructions available in the ISA. It takes time to get familiar with the broad array of options available to you as the programmer and as an experienced engineer, you can produce some impressive optimized code.