There are many C/C++ compilers available for Arm64 including:
- NVIDIA HPC Compiler
- Cray/HPE Compiler
- GCC
- LLVM
- Arm Compiler for Linux
The NVIDIA HPC Compiler is a direct decedent of the popular PGI C/C++ compiler, so it accepts the same compiler flags. GCC and LLVM operate more-or-less the same on Arm64 as on other architectures except for the -mcpu
flag, as described below. The Arm Compiler for Linux is based on LLVM and includes additional optimizations for Arm Neoverse cores that make it a highly performant compiler for CPU-only applications. However, it is missing some Fortran 2008 and OpenMP 4.5+ features that may be desirable for GPU-accelerated applications.
For GCC on Arm64, -mcpu=
acts as both specifying the appropriate architecture and tuning and it's generally better to use that vs -march
or -mtune
if you're building for a specific CPU. You can find additional details in this presentation from Arm Inc. to Stony Brook University.
CPU | Flag | GCC version | LLVM verison |
---|---|---|---|
Any Arm64 | -mcpu=native |
GCC-9+ | Clang/LLVM 10+ |
Ampere Altra | -mcpu=neoverse-n1 |
GCC-9+ | Clang/LLVM 10+ |
Whenever possible, please use the latest compiler version available on your system. Newer compilers provide better support and optimizations for Arm64 processors. Many codes will demonstrate at least 15% better performance when using GCC 10 vs. GCC 7. The table below shows GCC and LLVM compiler versions available in Linux distributions. Starred version marks the default system compiler.
Distribution | GCC | Clang/LLVM |
---|---|---|
Ubuntu 22.04 | 9, 10, 11*, 12 | 11, 12, 13, 14* |
Ubuntu 20.04 | 7, 8, 9*, 10 | 6, 7, 8, 9, 10, 11, 12 |
Ubuntu 18.04 | 4.8, 5, 6, 7*, 8 | 3.9, 4, 5, 6, 7, 8, 9, 10 |
Debian10 | 7, 8* | 6, 7, 8 |
Red Hat EL8 | 8*, 9, 10 | 10 |
SUSE Linux ES15 | 7*, 9, 10 | 7 |
All server-class Arm64 processors support low-cost atomic operations which can improve system throughput for CPU-to-CPU communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives. See Locks, Synchronization, and Atomics for details.
GCC's -mcpu=native
flag enables all instructions supported by the host CPU, including LSE. If you're cross compiling, use the appropriate -mcpu
option for your target CPU, e.g. -mcpu=neoverse-n1
for the Ampere Altra CPU. You can check which Arm features GCC will enable with the -mcpu=native
flag by using this command:
gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
For example, on the Ampere Altra CPU with GCC 9.4, we see "__ARM_FEATURE_ATOMICS 1
" indicating that LSE atomics are enabled:
gcc -dM -E -mcpu=native - < /dev/null | grep ARM_FEATURE
#define __ARM_FEATURE_ATOMICS 1
#define __ARM_FEATURE_UNALIGNED 1
#define __ARM_FEATURE_AES 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_QRDMX 1
#define __ARM_FEATURE_DOTPROD 1
#define __ARM_FEATURE_CRYPTO 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
#define __ARM_FEATURE_FMA 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_FEATURE_SHA2 1
#define __ARM_FEATURE_CRC32 1
#define __ARM_FEATURE_NUMERIC_MAXMIN 1
To confirm that LSE instructions are used, the output of the objdump
command line utility should contain LSE instructions:
$ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l
To check whether the application binary contains load and store exclusives:
$ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l
To quickly get a prototype running on Arm64, one can use
https://github.com/DLTcollab/sse2neon a translator of x64 intrinsics to NEON.
sse2neon provides a quick starting point in porting performance critical codes
to Arm. It shortens the time needed to get an Arm working program that then
can be used to extract profiles and to identify hot paths in the code. A header
file sse2neon.h
contains several of the functions provided by standard x64
include files like xmmintrin.h
, only implemented with NEON instructions to
produce the exact semantics of the x64 intrinsic. Once a profile is
established, the hot paths can be rewritten directly with NEON intrinsics to
avoid the overhead of the generic sse2neon translation.
Note that GCC's __sync
built-ins are outdated and may be biased towards the x86 memory model. Use __atomic
versions of these functions instead of the __sync
versions. See https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html for more details.
The C standard doesn't specify the signedness of char. On x86 char is signed by
default while on Arm it is unsigned by default. This can be addressed by using
standard int types that explicitly specify the signedness (e.g. uint8_t
and int8_t
)
or compile with -fsigned-char
.
Many Arm64 CPUs support Arm dot-product instructions commonly used for Machine Learning (quantized) inference workloads, and Half precision floating point (FP16). These features enable performant and power efficient machine learning by doubling the number of operations per second and reducing the memory footprint compared to single precision floating point (_float32), all while still enjoying large dynamic range. Compiling with -mcpu=native
will enable these features, when available. See the examples page for examples of how to use these features in ML frameworks like TensorFlow and PyTorch.