Add runtime detection module #7078

yuhaoth · 2023-02-10T02:48:12Z

Description

This is a demo implementation for #7004 with Arm64 AES implementation.

Preceding-PR : #7384

This is the first step for runtime detection if #7004 got approved.

We will only add runtime detection module in this PR and replace relative code in AES modules. Other part should be replace in next PRs.

I prefer follow bellow ruler.

Compiler check and CPU modifiers should be put in accelerator module.
Config option checks should be put in runtime.h.
MBEDTLS_RUNTIME_HAVE_CODE will be enabled in runtime.h
- Any algorithm module has there own option for runtime detection. like MBEDTLS_AES_RUNTIME_HAVE_CODE
- if any algorithm module enable runtime detection, MBEDTLS_RUNTIME_HAVE_CODE will be enabled.

Gatekeeper checklist

changelog provided, or not required
backport done, or not required
tests provided, or not required

Notes for the submitter

Please refer to the contributing guidelines, especially the
checklist for PR contributors.

uncleasm · 2023-02-16T12:50:29Z

library/aesce.c

+}
+
+TARGET_ATTR
+static inline uint8x16_t ghash_mult_rdc(uint8x16x2_t in)


There's nothing particularly wrong here, but reducing from uint8x16x3_t has 2-3x better throughput as follows:

uint8x16x3_t ghash_mult_128(uint8x16_t a, uint8x16_t b) { uint8x16x3_t ret; uint8x16_t c = vextq_u8(b, b); ret.val[1] = pmull_low(a,c); ret.val[0] = pmull_high(a,b); ret.val[2] = pmull_low(a,b); ret.val[1] = veorq_u8(ret.val[1], pmull_low(a,c)); return ret; } uint8x16_t gmul_reduce(uint8x16x3_t a) { uint8x16_t const Z = vdupq_n_u8(0); // use 'asm' as an optimisation barrier to prevent loading R from memory uint64x2_t r = vreinterpretq_u64_u8(vdupq_n_u8(0x87)); asm("" : "+w"(r)); uint8x16_t const R = vreinterpretq_u8_u64(vshrq_n_u64(r, 64 - 8)); uint8x16_t d = a.val[0]; // d3:d2:00:00 uint8x16_t j = a.val[1]; // j2:j1 uint8x16_t g = a.val[2]; // g1:g0 uint8x16_t h = pmull_high(d, R); // h2:h1 = reduction of d3 uint8x16_t i = pmull_low(d, R); // i1:i0 = reduction of d2 uint8x16_t k = veorq_u8(j, h); // k2:k1 = a0*b1 + a1*b0 + h2:h1 uint8x16_t l = pmull_high(k, R); // l1:l0 = reduction of k2 uint8x16_t m = vextq_u8(Z, k, 8); // m1:00 = k1:00 uint8x16_t n = veorq_u8(g, i); // n1:n0 = a0*b0 + reduction of d2 uint8x16_t o = veorq_u8(n, l); // o1:o0 uint8x16_t p = veorq_u8(o, m); // o1:o0 return p; }

There's no need to shift/combine the middle partial product to the high and low (this is also beneficial with Aggregated Reduction Method (or postponed reduction, multiplying with different powers of H)).
Fewer shifts means hugely reduced dependency chain and more parallelism (I measured about 3x better throughput with this method on M1, and 2x better on a very old Nokia Android One phone).
As a general comment, I'm glad that this library is being optimised.

Will change it in #6918. thanks @uncleasm

gilles-peskine-arm · 2023-02-16T22:57:27Z

library/aesce.c

+#include <arm_neon.h>
+
+#if defined(__linux__)
+#include <asm/hwcap.h>


I don't think including asm/hwcap.h is right. And since you're defining and using MBEDTLS_HWCAP_xxx constants manually, I don't think this header is used at all. I think we should use the system HWCAP_xxx if present though.

Will remove <asm/hwcap.h.

And I do not think we should use the system HWCAP_* if present.

Signed-off-by: Jerry Yu <[email protected]>

move the guards to `runtime_internal.h` for keeping consistent with AESCE. Signed-off-by: Jerry Yu <[email protected]>

Signed-off-by: Jerry Yu <[email protected]>

Also, define hwcap variable unconditionally.With/without alternative function, `mbedtls_cpu_hwcaps` is needed now Signed-off-by: Jerry Yu <[email protected]>

Signed-off-by: Jerry Yu <[email protected]>

Those function has been removed Signed-off-by: Jerry Yu <[email protected]>

Signed-off-by: Jerry Yu <[email protected]>

MBEDTLS_AES_RUNTIME_HAVE_CODE -> MBEDTLS_AES_CPUID_HAVE_CODE Signed-off-by: Jerry Yu <[email protected]>

Remove some temp macros. They are not necessary for checking if cpuid is needed Signed-off-by: Jerry Yu <[email protected]>

when AESNI and padlock are disable, compiler reports unused function error. It can be fixed within `cpu_feature_get()`, but it reduces readability. So we disable the module when it is not needed Signed-off-by: Jerry Yu <[email protected]>

Signed-off-by: Jerry Yu <[email protected]>

yuhaoth · 2023-09-27T06:26:49Z

Some thing has been changed from last review.

The module is renamed to cpuid
MBEDTLS_RUNTIME_HAVE_CODE is removed and add conditional build in cpuid.c due to CI failure.

Beside that, some issue should be resolve in future.

bn_mul.h : 1) replace architecture detection macros 2) if the module need CPU feature detection.
Should we add generic arm64 CPU feature detection ? detect with sys register or illegal instruction signal.
AESCE is available on A32 and T32 states, it should be enabled for the CPU states( see section F2.13.11of Arm® Architecture Reference Manual for A-profile architecture)

yuhaoth mentioned this pull request Feb 10, 2023

Proposal: Build-time and runtime detection for hardware acceleration with advance CPU features #7004

Open

3 tasks

yuhaoth force-pushed the pr/add-aes-compile-time-detection branch 3 times, most recently from d3b22f7 to 217b212 Compare February 15, 2023 12:39

yuhaoth mentioned this pull request Feb 15, 2023

Add AES with armv8 crypto extension #6895

Merged

3 tasks

uncleasm reviewed Feb 16, 2023

View reviewed changes

gilles-peskine-arm reviewed Feb 16, 2023

View reviewed changes

yuhaoth mentioned this pull request Feb 20, 2023

Fix: linux aarch64 compile when use llvm-toolset-7.0 on centos:7 #6224

Closed

3 tasks

yuhaoth added needs-work needs-preceding-pr Requires another PR to be merged first labels Feb 21, 2023

yuhaoth force-pushed the pr/add-aes-compile-time-detection branch from a4b9fdc to 55bfe86 Compare February 21, 2023 07:52

gilles-peskine-arm mentioned this pull request Feb 21, 2023

Linux/Aarch64: support SHA acceleration detection with older libc #7121

Closed

3 tasks

yuhaoth marked this pull request as draft March 28, 2023 06:39

yuhaoth removed the needs-preceding-pr Requires another PR to be merged first label Mar 28, 2023

yuhaoth force-pushed the pr/add-aes-compile-time-detection branch 4 times, most recently from fd8c606 to 64c8f73 Compare April 4, 2023 06:07

yuhaoth changed the title ~~[WIP]Enable runtime detection and build-time detection for Arm64 AES hardware accelerations~~ [WIP]Enable runtime detection for Arm64 AES hardware accelerator Apr 4, 2023

yuhaoth force-pushed the pr/add-aes-compile-time-detection branch from 64c8f73 to 9042866 Compare April 4, 2023 08:17

yuhaoth changed the title ~~[WIP]Enable runtime detection for Arm64 AES hardware accelerator~~ Add unify runtime detecion module and arm64 aesce detection. Apr 4, 2023

yuhaoth linked an issue Apr 4, 2023 that may be closed by this pull request

Improve runtime cpu feature detection #6921

Closed

yuhaoth changed the title ~~Add unify runtime detecion module and arm64 aesce detection.~~ Add unify runtime detecion module Apr 4, 2023

yuhaoth marked this pull request as ready for review April 4, 2023 09:42

yuhaoth added 22 commits September 27, 2023 10:13

Add x86/x64 runtime detection

9b2c2e0

Signed-off-by: Jerry Yu <[email protected]>

adjust padlock guard macros

ae2d209

move the guards to `runtime_internal.h` for keeping consistent with AESCE. Signed-off-by: Jerry Yu <[email protected]>

replace runtime detection for padlock

07462c4

Signed-off-by: Jerry Yu <[email protected]>

replace runtime detection for aesni

a9f0c9c

Signed-off-by: Jerry Yu <[email protected]>

Add freebsd runtime detection for arm64

f273ca0

Signed-off-by: Jerry Yu <[email protected]>

add windows arm64 runtime detection

cd14718

Signed-off-by: Jerry Yu <[email protected]>

Add apple runtime detection for Arm64

0a618ed

Signed-off-by: Jerry Yu <[email protected]>

add change log entry

1b2304b

Signed-off-by: Jerry Yu <[email protected]>

improve document

3a3fd64

Signed-off-by: Jerry Yu <[email protected]>

improve documents

621d835

Signed-off-by: Jerry Yu <[email protected]>

fix typo error

f203633

Signed-off-by: Jerry Yu <[email protected]>

change changelog name

52aec49

Signed-off-by: Jerry Yu <[email protected]>

fix various issues

04142b8

Signed-off-by: Jerry Yu <[email protected]>

Replace CPU feature detection interface

848521f

Also, define hwcap variable unconditionally.With/without alternative function, `mbedtls_cpu_hwcaps` is needed now Signed-off-by: Jerry Yu <[email protected]>

Remove detection function in modules

2b8bfe2

Signed-off-by: Jerry Yu <[email protected]>

fix test failure

d179261

Those function has been removed Signed-off-by: Jerry Yu <[email protected]>

Remove redundant macro

42defb2

Signed-off-by: Jerry Yu <[email protected]>

rename runtime module to cpuid module

ff2b9c8

Signed-off-by: Jerry Yu <[email protected]>

Rename macros and add config check

b691fe1

MBEDTLS_AES_RUNTIME_HAVE_CODE -> MBEDTLS_AES_CPUID_HAVE_CODE Signed-off-by: Jerry Yu <[email protected]>

Improve readability

8244547

Remove some temp macros. They are not necessary for checking if cpuid is needed Signed-off-by: Jerry Yu <[email protected]>

disable cpuid module when not needed

431fd69

when AESNI and padlock are disable, compiler reports unused function error. It can be fixed within `cpu_feature_get()`, but it reduces readability. So we disable the module when it is not needed Signed-off-by: Jerry Yu <[email protected]>

fix baremetal build failure

adc42c7

Signed-off-by: Jerry Yu <[email protected]>

yuhaoth force-pushed the pr/add-aes-compile-time-detection branch from 447190d to adc42c7 Compare September 27, 2023 02:13

yuhaoth requested review from tom-cosgrove-arm and daverodgman September 27, 2023 05:42

yuhaoth added needs-review Every commit must be reviewed by at least two team members, and removed needs-work labels Sep 27, 2023

yuhaoth mentioned this pull request Oct 24, 2023

Support SHA256 acceleration on Armv8 thumb2 and arm #8298

Merged

3 tasks

daverodgman marked this pull request as draft January 30, 2024 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add runtime detection module #7078

Add runtime detection module #7078

yuhaoth commented Feb 10, 2023 •

edited

Loading

uncleasm Feb 16, 2023

yuhaoth Feb 20, 2023

gilles-peskine-arm Feb 16, 2023

yuhaoth Feb 20, 2023 •

edited

Loading

yuhaoth commented Sep 27, 2023 •

edited

Loading

Add runtime detection module #7078

Are you sure you want to change the base?

Add runtime detection module #7078

Conversation

yuhaoth commented Feb 10, 2023 • edited Loading

Description

Gatekeeper checklist

Notes for the submitter

uncleasm Feb 16, 2023

Choose a reason for hiding this comment

yuhaoth Feb 20, 2023

Choose a reason for hiding this comment

gilles-peskine-arm Feb 16, 2023

Choose a reason for hiding this comment

yuhaoth Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

yuhaoth commented Sep 27, 2023 • edited Loading

yuhaoth commented Feb 10, 2023 •

edited

Loading

yuhaoth Feb 20, 2023 •

edited

Loading

yuhaoth commented Sep 27, 2023 •

edited

Loading