Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SIMD sections #21

Merged
merged 5 commits into from
Jul 18, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions chapters/10-Optimizing-Computations/10-3 Vectorization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,24 @@ typora-root-url: ..\..\img

## Vectorization {#sec:Vectorization}

On modern processors, the use of SIMD instructions can result in a great speedup over regular un-vectorized (scalar) code. When doing performance analysis, one of the top priorities of the software engineer is to ensure that the hot parts of the code are vectorized by the compiler. This section is supposed to guide engineers towards discovering vectorization opportunities. To recap on general information about the SIMD capabilities of modern CPUs, readers can take a look at [@sec:SIMD].
On modern processors, the use of SIMD instructions can result in a great speedup over regular un-vectorized (scalar) code. When doing performance analysis, one of the top priorities of the software engineer is to ensure that the hot parts of the code are vectorized. This section guides engineers towards discovering vectorization opportunities. For a recap on the SIMD capabilities of modern CPUs, readers can take a look at [@sec:SIMD].

Most vectorization happens automatically without any intervention of the user (Autovectorization). That is when a compiler automatically recognizes the opportunity to produce SIMD machine code from the source code. It is a good strategy to rely on autovectorization since modern compilers generate fast vectorized code for a wide variety of source code inputs. Echoing advice given earlier, the author recommends to let the compiler do its job and only interfere when it is needed.
Often vectorization happens automatically without any user intervention (autovectorization). That is when a compiler automatically recognizes the opportunity to produce SIMD machine code from the source code. Autovectorization could be a convenient solution because modern compilers generate fast vectorized code for a wide variety of source code inputs.

In rather rare cases, software engineers need to adjust autovectorization, based on the feedback[^2] that they get from a compiler or profiling data. In such cases, programmers need to tell the compiler that some code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the vectorizer directly and make sure that certain parts of the code are vectorized efficiently. There will be several examples of using compiler hints in the subsequent sections.
However, in some cases, autovectorization does not succeed without intervention by the software engineer, perhaps based on the feedback[^2] that they get from a compiler or profiling data. In such cases, programmers need to tell the compiler that some code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the vectorizer directly and make sure that certain parts of the code are vectorized efficiently. There will be several examples of using compiler hints in the subsequent sections. However, programmers have only limited input into the autovectorization process.

It is important to note that there is a range of problems where SIMD is invaluable and where autovectorization just does not work and is not likely to work in the near future (one can find an example in [@Mula_Lemire_2019]). If it is not possible to make a compiler generate desired assembly instructions, a code snippet can be rewritten with the help of compiler intrinsics. In most cases, compiler intrinsics provide a 1-to-1 mapping into assembly instruction (see [@sec:secIntrinsics]).
It is important to note that there is a range of problems where SIMD is important and where autovectorization just does not work and is not likely to work in the near future. One can find an example in [@Mula_Lemire_2019]. Compilers are also less likely to vectorize floating-point code because the results will differ numerically (more details later in this section). Code involving permutations or shuffles across vector lanes is also less likely to autovectorize, and this is likely to remain difficult for compilers. Finally, subsequent maintenance may change the structure of the code, such that autovectorization suddenly fails. Because this may occur long after the original software was written, it would be more expensive to fix or redo the implementation at this point.

\personalOpinion{Even though in some cases developers need to mess around with compiler intrinsics, I recommend to mostly rely on compiler auto-vectorization and only use intrinsics when necessary. A code that uses compiler intrinsics resembles inline assembly and quickly becomes unreadable. Compiler auto-vectorization can often be adjusted using pragmas and other hints.}
In such cases, or when it is not possible to make a compiler generate desired assembly instructions, code can instead be written using compiler intrinsics. In most cases, compiler intrinsics provide a 1-to-1 mapping to assembly instructions (see [@sec:secIntrinsics]). Intrinsics are somewhat easier to use than inline assembly because the compiler takes care of register allocation, and they allow the programmer to retain considerable control over code generation. However, they are still often verbose and difficult to read, and subject to behavioral differences or even bugs in various compilers.

Generally, three kinds of vectorization are done in a compiler: inner loop vectorization, outer loop vectorization, and SLP (Superword-Level Parallelism) vectorization. In this section, we will mostly consider inner loop vectorization since this is the most common case. We provide general information about the outer loop and SLP vectorization in appendix B.
For a middle path between low-effort but unpredictable autovectorization, and verbose/unreadable but predictable intrinsics, one can use a wrapper library around intrinsics. These tend to be more readable, can centralize compiler fixes in a library as opposed to scattering workarounds in user code, and still allow developers control over the generated code. Many such libraries exist, differing in their coverage of recent or 'exotic' operations, and the number of platforms they support. To our knowledge, Highway is currently the only one that fully supports scalable vectors as seen in the SVE and RISC-V V instruction sets. Note that one of the authors is the tech lead for this library. It will be introduced in [@sec:secIntrinsics].

Note that when using intrinsics or a wrapper library, it is still advisable to write the initial implementation using C++. This allows rapid prototyping and verification of correctness, by comparing the results of the original code against the new vectorized implementation.

### Compiler Autovectorization.

In the remainder of this section, we will discuss several of these approaches, especially inner loop vectorization because it is the most common type of autovectorization. The other two types, outer loop vectorization, and SLP (Superword-Level Parallelism) vectorization, are mentioned in appendix B.

Multiple hurdles can prevent auto-vectorization, some of which are inherent to the semantics of programming languages. For example, the compiler must assume that unsigned loop-indices may overflow, and this can prevent certain loop transformations. Another example is the assumption that the C programming language makes: pointers in the program may point to overlapping memory regions, which can make the analysis of the program very difficult. Another major hurdle is the design of the processor itself. In some cases processors, don’t have efficient vector instructions for certain operations. For example, performing predicated (bitmask-controlled) load and store operations are not available on most processors. Another example is vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes. Despite all of the challenges, the software developer can work around many of the challenges and enable vectorization. Later in the section, we provide guidance on how to work with the compiler and ensure that the hot code is vectorized by the compiler.


Expand Down Expand Up @@ -79,6 +83,8 @@ a.cpp:4:3: remark: vectorized loop (vectorization width: 4, interleaved count: 2
...
```

Unfortunately this flag involves subtle and potentially dangerous behavior changes, including for Not-a-Number, signed zero, infinity and subnormals. Because third-party code may not be ready for these effects, this flag should not be enabled across large sections of code without careful validation of the results, including for edge cases.

Let's look at another typical situation when a compiler may need support from a developer to perform vectorization. When compilers cannot prove that a loop operates on arrays with non-overlapping memory regions, they usually choose to be on the safe side. Let's revisit the example from [@lst:optReport] provided in [@sec:compilerOptReports]. When the compiler tries to vectorize the code presented in [@lst:OverlappingMemRefions], it generally cannot do this because the memory regions of arrays `a`, `b`, and `c` can overlap.

Listing: a.c
Expand Down Expand Up @@ -140,7 +146,7 @@ void stridedLoads(int *A, int *B, int n) {
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Developers should be aware of the hidden cost of using vectorized code. Using AVX and especially AVX512 vector instructions would lead to a big frequency downclocking. The vectorized portion of the code should be hot enough to justify using AVX512.[^38]
Developers should be aware of the hidden cost of using vectorized code. Using AVX and especially AVX-512 vector instructions could lead to frequency downclocking or startup overhead, which on certain CPUs can also affect subsequent code over a period of several microseconds. The vectorized portion of the code should be hot enough to justify using AVX-512.[^38] For example, sorting 80 KiB was found to be sufficient to amortize this overhead and make vectorization worthwhile.[^39]

### Loop vectorized but scalar version used.

Expand Down Expand Up @@ -208,3 +214,4 @@ Since function `calcSum` must return a single value (a `uniform` variable) and o
[^36]: See example on easyperf blog: [https://easyperf.net/blog/2017/11/03/Multiversioning_by_DD](https://easyperf.net/blog/2017/11/03/Multiversioning_by_DD).
[^37]: It is GCC specific pragma. For other compilers, check the corresponding manuals.
[^38]: For more details read this blog post: [https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html](https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html).
[^39]: Study of AVX-512 downclocking: in [VQSort readme](https://github.com/google/highway/blob/master/hwy/contrib/sort/README.md#study-of-avx-512-downclocking)
55 changes: 54 additions & 1 deletion chapters/10-Optimizing-Computations/10-4 Compiler Intrinsics.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ typora-root-url: ..\..\img

## Compiler Intrinsics {#sec:secIntrinsics}

There are types of applications that have very few hotspots that call for tuning them heavily. However, compilers do not always do what we want in terms of generated code in those hot places. For example, a program does some computation in a loop which the compiler vectorizes in a suboptimal way. It usually involves some tricky or specialized algorithms, for which we can come up with a better sequence of instructions. It can be very hard or even impossible to make the compiler generate the desired assembly code using standard constructs of the C and C++ languages.
There are types of applications that have hotspots worth tuning heavily. However, compilers do not always do what we want in terms of generated code in those hot places. For example, a program does some computation in a loop which the compiler vectorizes in a suboptimal way. It usually involves some tricky or specialized algorithms, for which we can come up with a better sequence of instructions. It can be very hard or even impossible to make the compiler generate the desired assembly code using standard constructs of the C and C++ languages.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[TODO for Denis] Example has a bug if the trip count is not a multiple of 4.


Hopefully, it's possible to force the compiler to generate particular assembly instructions without writing in low-level assembly language. To achieve that, one can use compiler intrinsics, which in turn are translated into specific assembly instructions. Intrinsics provide the same benefit as using inline assembly, but also they improve code readability, allow compiler type checking, assist instruction scheduling, and help reduce debugging. Example in [@lst:Intrinsics] shows how the same loop in function `foo` can be coded via compiler intrinsics (function `bar`).

Expand Down Expand Up @@ -33,4 +33,57 @@ Both functions in [@lst:Intrinsics] generate similar assembly instructions. Howe

When writing code using non-portable platform-specific intrinsics, developers should also provide a fallback option for other architectures. A list of all available intrinsics for the Intel platform can be found in this [reference](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)[^11].

### Use wrapper library for intrinsics

The write-once, target-many model of ISPC is appealing. However, we may wish for tighter integration into C++ programs, for example interoperability with templates, or avoiding a separate build step and using the same compiler. Conversely, intrinsics offer more control, but at a higher development cost.

We can combine the advantages of both and avoid these drawbacks using a so-called embedded domain-specific language, where the vector operations are expressed as normal C++ functions. You can think of these functions as 'portable intrinsics', for example `Add` or `LoadU`. Even compiling your code multiple times, once per instruction set, can be done within a normal C++ library by using the preprocessor to 'repeat' your code with different compiler settings, but within unique namespaces. One example of this is the previously mentioned Highway library[^12], which only requires the C++11 standard.

Like ISPC, Highway also supports detecting the best available instruction sets, grouped into 'clusters' which on x86 correspond to Intel Core (S-SSE3), Nehalem (SSE4.2), Haswell (AVX2), Skylake (AVX-512), or Icelake/Zen4 (AVX-512 with extensions). It then calls your code from the corresponding namespace.
Unlike intrinsics, the code remains readable (without prefixes/suffixes on each function) and portable.

Listing: Highway version of summing elements of an array.

~~~~ {#lst:HWY_code .cpp}
#include <hwy/highway.h>

float calcSum(const float* HWY_RESTRICT array, size_t count) {
const ScalableTag<float> d; // type descriptor; no actual data
auto sum = Zero(d);
size_t i = 0;
for (; i + Lanes(d) <= count; i += Lanes(d)) {
sum = Add(sum, LoadU(d, array + i));
}
sum = Add(sum, MaskedLoad(FirstN(d, count - i), d, array + i));
return ReduceSum(d, sum);
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Notice the explicit handling of remainders after the loop processes multiples of the vector sizes `Lanes(d)`. Although this is more verbose, it makes visible what is actually happening, and allows optimizations such as overlapping the last vector instead of relying on `MaskedLoad`, or even skipping the remainder entirely when `count` is known to be a multiple of the vector size.

Highway supports over 200 operations, which can be grouped into the following categories:

* Initialization
* Getting/setting lanes
* Getting/setting blocks
* Printing
* Tuples
* Arithmetic
* Logical
* Masks
* Comparisons
* Memory
* Cache control
* Type conversion
* Combine
* Swizzle/permute
* Swizzling within 128-bit blocks
* Reductions
* Crypto

For the full list of operations, see its documentation [^13] and [FAQ](https://github.com/google/highway/blob/master/g3doc/faq.md). You can also experiment with it in the online [Compiler Explorer](https://gcc.godbolt.org/z/zP7MYe9Yf).
Other libraries include Eigen, nsimd, SIMDe, VCL, and xsimd. Note that a C++ standardization effort starting with the Vc library resulted in std::experimental::simd, but this provides a very limited set of operations and as of this writing is only supported on the GCC 11 compiler.

dendibakh marked this conversation as resolved.
Show resolved Hide resolved
dendibakh marked this conversation as resolved.
Show resolved Hide resolved
[^11]: Intel Intrinsics Guide - [https://software.intel.com/sites/landingpage/IntrinsicsGuide/](https://software.intel.com/sites/landingpage/IntrinsicsGuide/).
[^12]: Highway library: [https://github.com/google/highway](https://github.com/google/highway)
[^13]: Highway Quick Reference - [https://github.com/google/highway/blob/master/g3doc/quick_reference.md](https://github.com/google/highway/blob/master/g3doc/quick_reference.md)
Loading