-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add runtime detection module #7078
base: development
Are you sure you want to change the base?
Add runtime detection module #7078
Conversation
d3b22f7
to
217b212
Compare
library/aesce.c
Outdated
} | ||
|
||
TARGET_ATTR | ||
static inline uint8x16_t ghash_mult_rdc(uint8x16x2_t in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing particularly wrong here, but reducing from uint8x16x3_t
has 2-3x better throughput as follows:
uint8x16x3_t ghash_mult_128(uint8x16_t a, uint8x16_t b) {
uint8x16x3_t ret;
uint8x16_t c = vextq_u8(b, b);
ret.val[1] = pmull_low(a,c);
ret.val[0] = pmull_high(a,b);
ret.val[2] = pmull_low(a,b);
ret.val[1] = veorq_u8(ret.val[1], pmull_low(a,c));
return ret;
}
uint8x16_t gmul_reduce(uint8x16x3_t a) {
uint8x16_t const Z = vdupq_n_u8(0);
// use 'asm' as an optimisation barrier to prevent loading R from memory
uint64x2_t r = vreinterpretq_u64_u8(vdupq_n_u8(0x87));
asm("" : "+w"(r));
uint8x16_t const R = vreinterpretq_u8_u64(vshrq_n_u64(r, 64 - 8));
uint8x16_t d = a.val[0]; // d3:d2:00:00
uint8x16_t j = a.val[1]; // j2:j1
uint8x16_t g = a.val[2]; // g1:g0
uint8x16_t h = pmull_high(d, R); // h2:h1 = reduction of d3
uint8x16_t i = pmull_low(d, R); // i1:i0 = reduction of d2
uint8x16_t k = veorq_u8(j, h); // k2:k1 = a0*b1 + a1*b0 + h2:h1
uint8x16_t l = pmull_high(k, R); // l1:l0 = reduction of k2
uint8x16_t m = vextq_u8(Z, k, 8); // m1:00 = k1:00
uint8x16_t n = veorq_u8(g, i); // n1:n0 = a0*b0 + reduction of d2
uint8x16_t o = veorq_u8(n, l); // o1:o0
uint8x16_t p = veorq_u8(o, m); // o1:o0
return p;
}
There's no need to shift/combine the middle partial product to the high and low (this is also beneficial with Aggregated Reduction Method (or postponed reduction, multiplying with different powers of H)).
Fewer shifts means hugely reduced dependency chain and more parallelism (I measured about 3x better throughput with this method on M1, and 2x better on a very old Nokia Android One phone).
As a general comment, I'm glad that this library is being optimised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
library/aesce.c
Outdated
#include <arm_neon.h> | ||
|
||
#if defined(__linux__) | ||
#include <asm/hwcap.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think including asm/hwcap.h
is right. And since you're defining and using MBEDTLS_HWCAP_xxx
constants manually, I don't think this header is used at all. I think we should use the system HWCAP_xxx
if present though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove <asm/hwcap.h
.
And I do not think we should use the system HWCAP_*
if present.
a4b9fdc
to
55bfe86
Compare
fd8c606
to
64c8f73
Compare
runtime detection
and build-time detection
for Arm64 AES hardware accelerationsruntime detection
for Arm64 AES hardware accelerator
64c8f73
to
9042866
Compare
runtime detection
for Arm64 AES hardware acceleratorSigned-off-by: Jerry Yu <[email protected]>
move the guards to `runtime_internal.h` for keeping consistent with AESCE. Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Also, define hwcap variable unconditionally.With/without alternative function, `mbedtls_cpu_hwcaps` is needed now Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Those function has been removed Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
MBEDTLS_AES_RUNTIME_HAVE_CODE -> MBEDTLS_AES_CPUID_HAVE_CODE Signed-off-by: Jerry Yu <[email protected]>
Remove some temp macros. They are not necessary for checking if cpuid is needed Signed-off-by: Jerry Yu <[email protected]>
when AESNI and padlock are disable, compiler reports unused function error. It can be fixed within `cpu_feature_get()`, but it reduces readability. So we disable the module when it is not needed Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
447190d
to
adc42c7
Compare
Some thing has been changed from last review.
Beside that, some issue should be resolve in future.
|
Description
This is a demo implementation for #7004 with Arm64 AES implementation.
Preceding-PR : #7384
This is the first step for runtime detection if #7004 got approved.
We will only add runtime detection module in this PR and replace relative code in AES modules. Other part should be replace in next PRs.
I prefer follow bellow ruler.
runtime.h
.MBEDTLS_RUNTIME_HAVE_CODE
will be enabled inruntime.h
MBEDTLS_AES_RUNTIME_HAVE_CODE
MBEDTLS_RUNTIME_HAVE_CODE
will be enabled.Gatekeeper checklist
Notes for the submitter
Please refer to the contributing guidelines, especially the
checklist for PR contributors.