-
Notifications
You must be signed in to change notification settings - Fork 256
Performance Tuning
Since SIMDe often relies heavily on autovectorization, compiler options are critical. We do what we can to help he compiler when possible, and we try to include optimized versions whenever we can, but there are things the compiler can optimize which we can't.
First, use -O3
(at least). There really is a lot of stuff in SIMDe which will vectorize at -O3
but not at -O2
. Take a look at GCC's documentation of -O3, taking note especially of -ftree-loop-vectorize
and -ftree-slp-vectorize
; those are incredibly important optimization for SIMDe, since it makes the compiler look at all of our little loops and try to turn them into a single instruction.
I know a lot of people have been told that they shouldn't use -O3
because it can break your code. To be clear, -O3
does not enable unsafe optimizations, it enables expensive (at compile time) optimizations. -Ofast
(or -ffast-math
) contain some unsafe optimizations (depending on your definition of "unsafe", anyways), and we'll get to them, but -O3 does not.
Historically, I think the idea that -O3
was unsafe came from two places. First, compilers were more buggy in the past, and since -O3
enables more cutting-edge optimizations you are more likely to run into one of those bugs. I'm not saying compilers don't contain bugs anymore (we have found lots of them while developing SIMDe), but SIMDe is very well tested and we work around all the bugs we find.
The other thing which led people to believe -O3
is unsafe is that it can expose bugs in your code which were dormant at other optimization levels. Generally it's because you're depending on undefined behavior, and at -O3
the compiler is performing more optimizations which means it's more likely to perform an optimization which assumes you're not relying on undefined behavior. UBSan can help you find places where your code is relying on undefined behavior so you can eliminate it.
While -O3
shouldn't break correct code, -Ofast
can. According to the description in GCC's documentation, -O3
will
Disregard strict standards compliance.
-Ofast
enables all-O3
optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on-ffast-math
,-fallow-store-data-races
and the Fortran-specific-fstack-arrays
, unless-fmax-stack-var-size
is specified, and-fno-protect-parens
.
Of those, -ffast-math
is particularly interesting for SIMDe; I won't go into detail about exactly what it does here; GCC's documentation has some information, but what you should know is that if you use -ffast-math
(including through -Ofast
) SIMDe will enable some internal optimization flags (``). TODO finish
OpenMP 4 includes support for SIMD parallelism by annotating loops with pragmas. For example:
#pragma omp simd
for (int i = 0 ; i < 4 ; i++) {
r[i] = a[i] + b[i];
}
SIMDe uses these annotations very extensively. We wrap it up in the SIMDE_VECTORIZE
macros so you don't necessarily see it in the code directly, but most of the portable implementations in SIMDe use it; I currently count 2667 instances of "SIMDE_VECTORIZE" in SIMDe.
When people see "OpenMP", I think a lot of people get scared off because they don't want to use the OpenMP runtime, which is used for multi-threading. However, the OpenMP SIMD pragma doesn't have anything to do with runtime OpenMP; all the magic happens at compile time. In fact, several compilers (-qopenmp-simd
for Intel C/C++ Compiler, and -fopenmp-simd
for GCC and clang) support options to enable OpenMP 4 SIMD support without enabling full OpenMP support; in this case, the OpenMP runtime won't even be linked.
The down side of just using -fopenmp-simd
instead of -fopenmp
is that the compiler doesn't actually communicate that OpenMP SIMD is enabled in a way we can observe in the source code (i.e., there is no _OPENMP_SIMD
macro), so SIMDe doesn't know that it is enabled and the SIMDE_VECTORIZE
macros won't output OpenMP SIMD pragmas. To get around this, you'll need to define the SIMDE_ENABLE_OPENMP
when compiling. For example: -fopenmp-simd -DSIMDE_ENABLE_OPENMP
instead of just -fopenmp-simd
.
There are several macros you can define to get SIMDe to output faster code as a trade-off for something else. Most of these can be enabled by defining the SIMDE_FAST_MATH
macro, which is also defined automatically when you pass -ffast-math
. Here are the individual options which are set if you set SIMDE_FAST_MATH
:
In my experience, most software doesn't really handle NaN
s, or "handles" them by avoiding generating them. NaN
s usually result from bad data which causes your code to do something like dividing by zero, taking the square root of a negative, etc.
Different platforms tend to have roughly equivalent functions which handle NaN
very differently. For example, consider the x86 _mm_min_ps
function and vminq_f32
. These functions are both intended to return the minimum of two values, but if one of those values is NaN
they behave differently:
a | b | _mm_min_ps |
vminq_f32 |
---|---|---|---|
Real | Real | Real | Real |
Real | NaN | NaN | NaN |
NaN | Real | Real | NaN |
NaN | NaN | NaN | NaN |
In SIMDe, that means that we can't normally implement _mm_min_ps
using vminq_f32
; we have to do something like vec_sel(b, a, vec_cmpgt(b, a))
. On the NEON side we can't use _mm_min_ps
to implement vminq_f32
; we have to do something like _mm_blendv_ps(_mm_set1_ps(NaN), _mm_min_ps(a, b), _mm_cmpord_ps(a, b))
.
SIMDE_FAST_NANS
tells SIMDe to just go ahead and ignore issues like this; just implement _mm_min_ps
and vminq_f32
using one another. The vast majority of applications don't really care how NaNs are handled because there should never be any NaNs in the first case, and even if there are the code doesn't really know what to do with them. In these cases, SIMDE_FAST_NANS
can provide a significant speed-up for free.
In many cases, we only want to apply an operation on part of a vector, but it's a lot faster to apply the operation to the entire vector, then blend the lanes we're interested in with the original vector. Unfortunately, if we do this and there is garbage in the lanes we're not interested in, we can end up with spurious floating point exceptions. SIMDE_FAST_EXCEPTIONS
tells SIMDe to go ahead and ignore this, which is safe for most applications.
To be clear, these aren't C++ exceptions. If you're not using functions like _mm_getcsr
or fegetexcept
, you're not doing anything with these exceptions anyways.
The _mm_*_ss
and _mm_*_sd
functions are a good example of this; they only operate on the lowest element in the input. To get around this, SIMDe will first broadcast the lowest lane to all elements, then perform the operation, then blend the result into the lane we're interested in. Unless, of course, you use SIMDE_FAST_EXCEPTIONS
.
This is likely to be much more important in the future; we're currently mostly ignoring this issue for predicated instruction sets like AVX-512 and SVE, but once that changes not using SIMDE_FAST_EXCEPTIONS
will likely result in a major performance hit.
Some functions have equivalents on different platforms which are the same except for how ties are rounded. For example, on x86 _mm_cvtps_epi32
seems to do pretty much the same thing as vcvtnq_s32_f32
, but _mm_cvtps_epi32
uses the current rounding mode, whereas vcvtnq_s32_f32
always rounds ties towards even (which is the default rounding mode).
SIMDE_FAST_ROUND_TIES
tells SIMDe to just ignore these differences.
This is a bit more heavy-handed than SIMDE_FAST_ROUND_TIES
, as it applies to to which rounding mode is used (usually truncation or towards even, but potentially also floor or ceiling).
Note that this mode only applies to functions where rounding is not the primary operation. For example, _mm_floor_ps
will always round down, even if SIMDE_FAST_ROUND_TIES
is defined.
For functions which convert from floating-point to integer types, there can be differences between platforms regarding how out-of-range values are handled. For example, if you're converting from 32-bit floats to 32-bit ints, how values outside of [INT32_MIN
, INT32_MAX
] are handled can vary; maybe on one platform out-of-range values return 0, whereas on others they are saturated. SIMDE_FAST_CONVERSION_RANGE
allows SIMDe to ignore these differences.
It's worth noting that out-of-range conversions are undefined behavior in C, per § 6.3.1.4 of the standard:
When a finite value of real floating type is converted to an integer type other than
_Bool
, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
The SIMD APIs which SIMDe provides, however, were originally intended to be hardware-specific so out-of-range conversions are defined by the hardware, and SIMDe has to honor that. However, most applications do not rely on out-of-range conversions, and it should generally be pretty safe to enable this.
Not all of the implementations in SIMDe as are as well-optimized as they could be. While our goal is to make sure that every function in SIMDe is as fast as we can make it on every platform, the reality is that SIMDe is enormous and our resources are limited, which means we have to focus our efforts. To that end, if you have real-world code using SIMDe on any target, any data you have about where in SIMDe your code is spending time would be extremely valuable to us.
Profiling tools and usage will vary by platform, but we'll take whatever we can get. If you'd like to add a section (or link) below on gathering profiling data on a specific platform or using a specific tool, please feel free.
Since SIMDe inlines everything by default, getting good profiling data can be a bit tricky. However, definining SIMDE_NO_INLINE
prior to including SIMDe (for example, by adding -DSIMDE_NO_INLINE
to your CFLAGS
or CXXFLAGS
environment variable(s)) will change this to never inline. It will be a big performance hit so you should never do it in production, but it can be invaluable for profiling!
To figure out exactly what is slow, we use V8's built-in sampling profiler.
We also need to enable debug information to get function names (pass -g
to the compiler), and also turn of SIMDe's inlining declarations (using -DSIMDE_NO_INLINE
) so that we can see which intrinsics spend the most time.
Run benchmarks using d8 --prof Test.js
, which generates a v8.log
in the current working directory.
Next, process the generated output using tools/linux-tick-processor v8.log
.