Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SIMD sections #21

Merged
merged 5 commits into from
Jul 18, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions chapters/10-Optimizing-Computations/10-3 Vectorization.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,19 @@ On modern processors, the use of SIMD instructions can result in a great speedup

Often vectorization happens automatically without any user intervention (autovectorization). That is when a compiler automatically recognizes the opportunity to produce SIMD machine code from the source code. Autovectorization could be a convenient solution because modern compilers generate fast vectorized code for a wide variety of source code inputs.

However, in some cases, software engineers need to intervene based on the feedback[^2] that they get from a compiler or profiling data. In such cases, programmers need to tell the compiler that some code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the vectorizer directly and make sure that certain parts of the code are vectorized efficiently. There will be several examples of using compiler hints in the subsequent sections.
However, in some cases, autovectorization does not succeed without intervention by the software engineer, perhaps based on the feedback[^2] that they get from a compiler or profiling data. In such cases, programmers need to tell the compiler that some code region is vectorizable or that vectorization is profitable. Modern compilers have extensions that allow power users to control the vectorizer directly and make sure that certain parts of the code are vectorized efficiently. There will be several examples of using compiler hints in the subsequent sections. However, programmers have only limited input into the autovectorization process.

It is important to note that there is a range of problems where SIMD is important and where autovectorization just does not work and is not likely to work in the near future. One can find an example in [@Mula_Lemire_2019]. Compilers are also less likely to vectorize floating-point code because the results will differ numerically (more details later in this section). Code involving permutations or shuffles across vector lanes is also less likely to autovectorize, and this is likely to remain difficult for compilers.
It is important to note that there is a range of problems where SIMD is important and where autovectorization just does not work and is not likely to work in the near future. One can find an example in [@Mula_Lemire_2019]. Compilers are also less likely to vectorize floating-point code because the results will differ numerically (more details later in this section). Code involving permutations or shuffles across vector lanes is also less likely to autovectorize, and this is likely to remain difficult for compilers. Finally, subsequent maintenance may change the structure of the code, such that autovectorization suddenly fails. Because this may occur long after the original software was written, it would be more expensive to fix or redo the implementation at this point.

In such cases, or when it is not possible to make a compiler generate desired assembly instructions, code snippets can be rewritten with the help of compiler intrinsics. In most cases, compiler intrinsics provide a 1-to-1 mapping to assembly instructions (see [@sec:secIntrinsics]). Intrinsics are somewhat easier to use than inline assembly because the compiler takes care of register allocation. However, they are still often verbose and difficult to read, and subject to behavioral differences or even bugs in various compilers.

\personalOpinion{One of the authors recommends mostly relying on compiler autovectorization and only using intrinsics when necessary. Another is sometimes pleasantly surprised by autovectorization, but more often disappointed in all but trivial cases, and so so recommends manual vectorization to be certain of what you get.}
In such cases, or when it is not possible to make a compiler generate desired assembly instructions, code can instead be written using compiler intrinsics. In most cases, compiler intrinsics provide a 1-to-1 mapping to assembly instructions (see [@sec:secIntrinsics]). Intrinsics are somewhat easier to use than inline assembly because the compiler takes care of register allocation, and they allow the programmer to retain considerable control over code generation. However, they are still often verbose and difficult to read, and subject to behavioral differences or even bugs in various compilers.

For a middle path between low-effort but unpredictable autovectorization, and verbose/unreadable but predictable intrinsics, one can use a wrapper library around intrinsics. These tend to be more readable, can centralize compiler fixes in a library as opposed to scattering workarounds in user code, and still allow developers control over the generated code. Many such libraries exist, differing in their coverage of recent or 'exotic' operations, and the number of platforms they support. To our knowledge, Highway is currently the only one that fully supports scalable vectors as seen in the SVE and RISC-V V instruction sets. Note that one of the authors is the tech lead for this library. It will be introduced in [@sec:secIntrinsics].

In the remainder of this section, we will discuss several of these approaches, especially inner loop vectorization because it is the most common type of autovectorization. The other two types, outer loop vectorization, and SLP (Superword-Level Parallelism) vectorization, are mentioned in appendix B.
Note that when using intrinsics or a wrapper library, it is still advisable to write the initial implementation using C++. This allows rapid prototyping and verification of correctness, by comparing the results of the original code against the new vectorized implementation.

### Compiler autovectorization.
### Compiler Autovectorization.

In the remainder of this section, we will discuss several of these approaches, especially inner loop vectorization because it is the most common type of autovectorization. The other two types, outer loop vectorization, and SLP (Superword-Level Parallelism) vectorization, are mentioned in appendix B.

Multiple hurdles can prevent auto-vectorization, some of which are inherent to the semantics of programming languages. For example, the compiler must assume that unsigned loop-indices may overflow, and this can prevent certain loop transformations. Another example is the assumption that the C programming language makes: pointers in the program may point to overlapping memory regions, which can make the analysis of the program very difficult. Another major hurdle is the design of the processor itself. In some cases processors, don’t have efficient vector instructions for certain operations. For example, performing predicated (bitmask-controlled) load and store operations are not available on most processors. Another example is vector-wide format conversion between signed integers to doubles because the result operates on vector registers of different sizes. Despite all of the challenges, the software developer can work around many of the challenges and enable vectorization. Later in the section, we provide guidance on how to work with the compiler and ensure that the hot code is vectorized by the compiler.

Expand Down
24 changes: 22 additions & 2 deletions chapters/10-Optimizing-Computations/10-4 Compiler Intrinsics.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,28 @@ float calcSum(const float* HWY_RESTRICT array, size_t count) {

Notice the explicit handling of remainders after the loop processes multiples of the vector sizes `Lanes(d)`. Although this is more verbose, it makes visible what is actually happening, and allows optimizations such as overlapping the last vector instead of relying on `MaskedLoad`, or even skipping the remainder entirely when `count` is known to be a multiple of the vector size.

Highway supports over 200 operations. For the full list, see its documentation [^13] and [FAQ](https://github.com/google/highway/blob/master/g3doc/faq.md). You can also experiment with it in the online [Compiler Explorer](https://gcc.godbolt.org/z/zP7MYe9Yf).

Highway supports over 200 operations, which can be grouped into the following categories:

* Initialization
* Getting/setting lanes
* Getting/setting blocks
* Printing
* Tuples
* Arithmetic
* Logical
* Masks
* Comparisons
* Memory
* Cache control
* Type conversion
* Combine
* Swizzle/permute
* Swizzling within 128-bit blocks
* Reductions
* Crypto

For the full list of operations, see its documentation [^13] and [FAQ](https://github.com/google/highway/blob/master/g3doc/faq.md). You can also experiment with it in the online [Compiler Explorer](https://gcc.godbolt.org/z/zP7MYe9Yf).
Other libraries include Eigen, nsimd, SIMDe, VCL, and xsimd. Note that a C++ standardization effort starting with the Vc library resulted in std::experimental::simd, but this provides a very limited set of operations and as of this writing is only supported on the GCC 11 compiler.

dendibakh marked this conversation as resolved.
Show resolved Hide resolved
dendibakh marked this conversation as resolved.
Show resolved Hide resolved
[^11]: Intel Intrinsics Guide - [https://software.intel.com/sites/landingpage/IntrinsicsGuide/](https://software.intel.com/sites/landingpage/IntrinsicsGuide/).
[^12]: Highway library: [https://github.com/google/highway](https://github.com/google/highway)
Expand Down
7 changes: 4 additions & 3 deletions chapters/3-CPU-Microarchitecture/3-7 SIMD.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@ for (int i = 0; i < N; ++i) {

![Example of scalar and SIMD operations.](../../img/uarch/SIMD.png){#fig:SIMD width=80%}

[TODO]: update on NEON and SVE.
Most of the popular CPU architectures feature vector instructions, including x86, PowerPC, Arm, and RISC-V. In 1996 Intel released a new instruction set, MMX, which was a SIMD instruction set that was designed for multimedia applications. Following MMX, Intel introduced new instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2, AVX-512. Arm has optionally supported the 128-bit NEON instruction set in various versions of its architecture. In version 8 (aarch64), this support was made mandatory, and new instructions were added. To allow making use of wider vectors without having to port software to each vector length, Arm subsequently introduced the SVE instruction set. Its defining characteristic is the concept of "scalable" vectors: their length is unknown at compile time. SVE provides special instructions for length-dependent operations such as incrementing pointers.
dendibakh marked this conversation as resolved.
Show resolved Hide resolved

Most of the popular CPU architectures feature vector instructions, including x86, PowerPC, ARM, and RISC-V. In 1996 Intel released a new instruction set, MMX, which was a SIMD instruction set that was designed for multimedia applications. Following MMX, Intel introduced new instruction sets with added capabilities and increased vector size: SSE, AVX, AVX2, AVX-512. As soon as the new instruction sets became available, work began to make them usable to software engineers. At first, the new SIMD instructions were programmed in assembly. Later, special compiler intrinsics were introduced. Today all of the major compilers support vectorization for the popular processors.
Another example of scalable vectors is the RISC-V V extension, which was ratified in late 2021. Some implementations support quite wide (2048 bit) vectors, and up to eight can be grouped together to yield 16,384 bit vectors, which greatly reduces the number of instructions executed.

As the new instruction sets became available, work began to make them usable to software engineers. At first, SIMD instructions were programmed in assembly. Later, special compiler intrinsics were introduced. Today all of the major compilers support vectorization for the popular processors.

Over time, the set of operations supported in SIMD has steadily increased. In addition to straightforward arithmetic as shown above, newer use cases of SIMD include:

Expand All @@ -37,4 +39,3 @@ Over time, the set of operations supported in SIMD has steadily increased. In ad
[^4]: SIMD hashing: [https://github.com/google/highwayhash](https://github.com/google/highwayhash)
[^5]: Random generation: [abseil library](https://github.com/abseil/abseil-cpp/blob/master/absl/random/internal/randen.h)
[^6]: Sorting: [VQSort](https://github.com/google/highway/tree/master/hwy/contrib/sort)