Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rANS_static32x16pr_neon.c doesn't build on AArch32 #81

Closed
clausecker opened this issue Apr 17, 2023 · 8 comments · Fixed by #82
Closed

rANS_static32x16pr_neon.c doesn't build on AArch32 #81

clausecker opened this issue Apr 17, 2023 · 8 comments · Fixed by #82

Comments

@clausecker
Copy link

Your code base has NEON-optimised code in rANS_static32x16pr_neon.c. Unfortunately, this code only builds on AArch64 (arm64). It does not build on AArch32 as many of the intrinsics used therein are not supported in AArch32 mode:

htscodecs/htscodecs/rANS_static32x16pr_neon.c:200:31: warning: implicit declaration of function 'vtrn1q_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t A1_1 = vtrn1q_u32(A_1, B_1);
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:200:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A1_1 = vtrn1q_u32(A_1, B_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:201:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C1_1 = vtrn1q_u32(C_1, D_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:202:31: warning: implicit declaration of function 'vtrn2q_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t A2_1 = vtrn2q_u32(A_1, B_1);
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:202:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A2_1 = vtrn2q_u32(A_1, B_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:203:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C2_1 = vtrn2q_u32(C_1, D_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:207:39: warning: implicit declaration of function 'vtrn1q_u64' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t Xmaxv1=u32_u64(vtrn1q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                                      ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:207:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Xmaxv1=u32_u64(vtrn1q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:208:39: warning: implicit declaration of function 'vtrn2q_u64' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t Biasv1=u32_u64(vtrn2q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                                      ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:208:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Biasv1=u32_u64(vtrn2q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:209:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t RFv1  =u32_u64(vtrn1q_u64(u64_u32(A2_1),u64_u32(C2_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:210:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t FSv1  =u32_u64(vtrn2q_u64(u64_u32(A2_1),u64_u32(C2_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:217:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A1_2 = vtrn1q_u32(A_2, B_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:218:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C1_2 = vtrn1q_u32(C_2, D_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:219:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A2_2 = vtrn2q_u32(A_2, B_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:220:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C2_2 = vtrn2q_u32(C_2, D_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:222:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Xmaxv2=u32_u64(vtrn1q_u64(u64_u32(A1_2),u64_u32(C1_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:223:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Biasv2=u32_u64(vtrn2q_u64(u64_u32(A1_2),u64_u32(C1_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:224:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t RFv2  =u32_u64(vtrn1q_u64(u64_u32(A2_2),u64_u32(C2_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:225:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t FSv2  =u32_u64(vtrn2q_u64(u64_u32(A2_2),u64_u32(C2_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:232:31: warning: implicit declaration of function 'vaddvq_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint32_t imask1 = vaddvq_u32(vandq_u32(Cv1, bit));
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:237:21: warning: implicit declaration of function 'vqtbl1_u8' is invalid in C99 [-Wimplicit-function-declaration]
            norm1 = vqtbl1_u8(vreinterpretq_u8_u32(Rv1),vtab[imask1]);
                    ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:237:19: error: assigning to 'uint8x8_t' (vector of 8 'uint8_t' values) from incompatible type 'int'
            norm1 = vqtbl1_u8(vreinterpretq_u8_u32(Rv1),vtab[imask1]);
                  ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:238:19: error: assigning to 'uint8x8_t' (vector of 8 'uint8_t' values) from incompatible type 'int'
            norm2 = vqtbl1_u8(vreinterpretq_u8_u32(Rv2),vtab[imask2]);
                  ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:264:31: warning: implicit declaration of function 'vmull_high_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint64x2_t qvh1 = vmull_high_u32(Rv1, RFv1);
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:264:24: error: initializing 'uint64x2_t' (vector of 2 'uint64_t' values) with an expression of incompatible type 'int'
            uint64x2_t qvh1 = vmull_high_u32(Rv1, RFv1);
                       ^      ~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]

Please adapt your build system to build this code on AArch64 only. Alternatively, fix the code so these files compile on AArch32, too.

@jkbonfield
Copy link
Collaborator

Is this building from a release, or from the develop branch at github?

I assume this is related to the changes in samtools/htslib#1587, but I cannot test it as I don't have any way of testing on an AArch32 system (or does -m32 do this on a 64-bit system?).

See also #78. It sounds like Rob's changes are to use compile time #ifdef __ARM_NEON checks. I don't understand why that would attempt to add NEON instructions on a CPU that doesn't include them. Any advice on how to detect this properly at compile time would be welcomed.

@clausecker
Copy link
Author

This was found on 1.17, your latest release.

ARMv7 systems do generally support NEON, but NEON on AArch32 is only a subset of NEON on AArch64. The ARM intrinsics guides specify which NEON intrinsics are available on both and which are AArch64 only. The problem is that many of the intrinsics your code uses are available for AArch64 only. The macro __ARM_NEON only shows if any sort of NEON support is available. This is true on AArch32 too. But not all intrinsics are available.

@jkbonfield
Copy link
Collaborator

Thanks. It sounds like we need to check __aarch64__ in addition to __ARM_NEON then. Eg https://github.com/daviesrob/htscodecs/blob/24eb2b7a1282d27002ef9848a2aa1dd92d38dd78/htscodecs/rANS_static32x16pr_neon.c#L35

@clausecker
Copy link
Author

Yes, correct. Or rework the code so it only uses intrinsics that also work on AArch32. Or add polyfill for the missing intrinsics when running on AArch32.

@jkbonfield jkbonfield transferred this issue from samtools/htslib Apr 18, 2023
@jkbonfield
Copy link
Collaborator

jkbonfield commented Apr 18, 2023

PR #82 should enable it to compiler, but the difference is trivial. I don't have SIMD for Aarch32 as that's complex and probably also fruitless, but it can obviously run the scalar implementation just fine.

Unfortunately I'm not able to test it as AWS Arm is Aarch64 only.

jkbonfield added a commit to jkbonfield/htscodecs that referenced this issue Apr 18, 2023
NEON alone isn't a sufficient guard as AArch32 also has some limited
Neon capabilities.  While we could no doubt have a 32-bit alternative,
for now this is the simple fix and let aarch32 use the scalar
implementation.

Doing a 32-bit neon is a complex task and without having access to the
hardware it's pretty much impossible.  I also wouldn't have high hopes
for any significant speed gains over scalar with only half the lanes
available.

Fixes samtools#81
jkbonfield added a commit to jkbonfield/htscodecs that referenced this issue Apr 18, 2023
NEON alone isn't a sufficient guard as AArch32 also has some limited
Neon capabilities.  While we could no doubt have a 32-bit alternative,
for now this is the simple fix and let aarch32 use the scalar
implementation.

Doing a 32-bit neon is a complex task and without having access to the
hardware it's pretty much impossible.  I also wouldn't have high hopes
for any significant speed gains over scalar with only half the lanes
available.

Fixes samtools#81
@clausecker
Copy link
Author

This matches the changes I plan to deploy as local patches to the FreeBSD port of this package. I've also patched the NEON detection in configure to ensure it only triggers if AArch64-only intrinsics are available.

@jkbonfield
Copy link
Collaborator

The neon tests in configure were removed in #78 as they conflict with MacOS multi-arch builds. The Mac basically has a compiler that performs 2 compilations, for ARM and Amd64 CPUs, so any tests of compiler capability (like does it support a -mavx2 option) will break. So instead it all needs to be done with on-the-fly CPU detection via ifdefs instead.

@clausecker
Copy link
Author

@jkbonfield Good to hear. This is LGTM then from my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants