Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runtime detection module #7078

Draft
wants to merge 29 commits into
base: development
Choose a base branch
from

Conversation

yuhaoth
Copy link
Contributor

@yuhaoth yuhaoth commented Feb 10, 2023

Description

This is a demo implementation for #7004 with Arm64 AES implementation.

Preceding-PR : #7384

This is the first step for runtime detection if #7004 got approved.

We will only add runtime detection module in this PR and replace relative code in AES modules. Other part should be replace in next PRs.

I prefer follow bellow ruler.

  • Compiler check and CPU modifiers should be put in accelerator module.
  • Config option checks should be put in runtime.h.
  • MBEDTLS_RUNTIME_HAVE_CODE will be enabled in runtime.h
    • Any algorithm module has there own option for runtime detection. like MBEDTLS_AES_RUNTIME_HAVE_CODE
    • if any algorithm module enable runtime detection, MBEDTLS_RUNTIME_HAVE_CODE will be enabled.

Gatekeeper checklist

  • changelog provided, or not required
  • backport done, or not required
  • tests provided, or not required

Notes for the submitter

Please refer to the contributing guidelines, especially the
checklist for PR contributors.

library/aesce.c Outdated
}

TARGET_ATTR
static inline uint8x16_t ghash_mult_rdc(uint8x16x2_t in)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing particularly wrong here, but reducing from uint8x16x3_t has 2-3x better throughput as follows:

uint8x16x3_t ghash_mult_128(uint8x16_t a, uint8x16_t b) {
      uint8x16x3_t ret;
      uint8x16_t c = vextq_u8(b, b);
      ret.val[1] = pmull_low(a,c);
      ret.val[0] = pmull_high(a,b);
      ret.val[2] = pmull_low(a,b);
      ret.val[1] = veorq_u8(ret.val[1], pmull_low(a,c));
      return ret;
}

uint8x16_t gmul_reduce(uint8x16x3_t a) {
    uint8x16_t const Z = vdupq_n_u8(0);
    // use 'asm' as an optimisation barrier to prevent loading R from memory
    uint64x2_t r = vreinterpretq_u64_u8(vdupq_n_u8(0x87));
    asm("" : "+w"(r));
    uint8x16_t const R = vreinterpretq_u8_u64(vshrq_n_u64(r, 64 - 8));
    uint8x16_t d = a.val[0];                //     d3:d2:00:00
    uint8x16_t j = a.val[1];                //        j2:j1
    uint8x16_t g = a.val[2];                //           g1:g0
    uint8x16_t h = pmull_high(d, R);        //        h2:h1     = reduction of d3
    uint8x16_t i = pmull_low(d, R);         //           i1:i0  = reduction of d2
    uint8x16_t k = veorq_u8(j, h);          //        k2:k1     = a0*b1 + a1*b0 + h2:h1
    uint8x16_t l = pmull_high(k, R);        //           l1:l0  = reduction of k2
    uint8x16_t m = vextq_u8(Z, k, 8);       //           m1:00  = k1:00
    uint8x16_t n = veorq_u8(g, i);          //           n1:n0  = a0*b0 + reduction of d2
    uint8x16_t o = veorq_u8(n, l);          //           o1:o0
    uint8x16_t p = veorq_u8(o, m);          //           o1:o0
    return p;
}

There's no need to shift/combine the middle partial product to the high and low (this is also beneficial with Aggregated Reduction Method (or postponed reduction, multiplying with different powers of H)).
Fewer shifts means hugely reduced dependency chain and more parallelism (I measured about 3x better throughput with this method on M1, and 2x better on a very old Nokia Android One phone).
As a general comment, I'm glad that this library is being optimised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it in #6918. thanks @uncleasm

library/aesce.c Outdated
#include <arm_neon.h>

#if defined(__linux__)
#include <asm/hwcap.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think including asm/hwcap.h is right. And since you're defining and using MBEDTLS_HWCAP_xxx constants manually, I don't think this header is used at all. I think we should use the system HWCAP_xxx if present though.

Copy link
Contributor Author

@yuhaoth yuhaoth Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove <asm/hwcap.h.

And I do not think we should use the system HWCAP_* if present.

@yuhaoth yuhaoth added needs-work needs-preceding-pr Requires another PR to be merged first labels Feb 21, 2023
@yuhaoth yuhaoth force-pushed the pr/add-aes-compile-time-detection branch from a4b9fdc to 55bfe86 Compare February 21, 2023 07:52
@yuhaoth yuhaoth marked this pull request as draft March 28, 2023 06:39
@yuhaoth yuhaoth removed the needs-preceding-pr Requires another PR to be merged first label Mar 28, 2023
@yuhaoth yuhaoth force-pushed the pr/add-aes-compile-time-detection branch 4 times, most recently from fd8c606 to 64c8f73 Compare April 4, 2023 06:07
@yuhaoth yuhaoth changed the title [WIP]Enable runtime detection and build-time detection for Arm64 AES hardware accelerations [WIP]Enable runtime detection for Arm64 AES hardware accelerator Apr 4, 2023
@yuhaoth yuhaoth force-pushed the pr/add-aes-compile-time-detection branch from 64c8f73 to 9042866 Compare April 4, 2023 08:17
@yuhaoth yuhaoth changed the title [WIP]Enable runtime detection for Arm64 AES hardware accelerator Add unify runtime detecion module and arm64 aesce detection. Apr 4, 2023
@yuhaoth yuhaoth linked an issue Apr 4, 2023 that may be closed by this pull request
@yuhaoth yuhaoth added component-platform Portability layer and build scripts needs-review Every commit must be reviewed by at least two team members, component-crypto Crypto primitives and low-level interfaces needs-preceding-pr Requires another PR to be merged first needs-reviewer This PR needs someone to pick it up for review and removed needs-work labels Apr 4, 2023
@yuhaoth yuhaoth changed the title Add unify runtime detecion module and arm64 aesce detection. Add unify runtime detecion module Apr 4, 2023
@yuhaoth yuhaoth marked this pull request as ready for review April 4, 2023 09:42
move the guards to `runtime_internal.h` for keeping
consistent with AESCE.

Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
Also, define hwcap variable unconditionally.With/without
alternative function, `mbedtls_cpu_hwcaps` is needed now

Signed-off-by: Jerry Yu <[email protected]>
Those function has been removed

Signed-off-by: Jerry Yu <[email protected]>
Signed-off-by: Jerry Yu <[email protected]>
MBEDTLS_AES_RUNTIME_HAVE_CODE -> MBEDTLS_AES_CPUID_HAVE_CODE

Signed-off-by: Jerry Yu <[email protected]>
Remove some temp macros. They are not
necessary for checking if cpuid is needed

Signed-off-by: Jerry Yu <[email protected]>
when AESNI and padlock are disable, compiler
reports unused function error. It can be fixed
within `cpu_feature_get()`, but it reduces
readability. So we disable the module when it
is not needed

Signed-off-by: Jerry Yu <[email protected]>
@yuhaoth yuhaoth force-pushed the pr/add-aes-compile-time-detection branch from 447190d to adc42c7 Compare September 27, 2023 02:13
@yuhaoth yuhaoth added needs-review Every commit must be reviewed by at least two team members, and removed needs-work labels Sep 27, 2023
@yuhaoth
Copy link
Contributor Author

yuhaoth commented Sep 27, 2023

Some thing has been changed from last review.

  • The module is renamed to cpuid
  • MBEDTLS_RUNTIME_HAVE_CODE is removed and add conditional build in cpuid.c due to CI failure.

Beside that, some issue should be resolve in future.

  • bn_mul.h : 1) replace architecture detection macros 2) if the module need CPU feature detection.
  • Should we add generic arm64 CPU feature detection ? detect with sys register or illegal instruction signal.
  • AESCE is available on A32 and T32 states, it should be enabled for the CPU states( see section F2.13.11of Arm® Architecture Reference Manual for A-profile architecture)

@daverodgman daverodgman marked this pull request as draft January 30, 2024 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component-crypto Crypto primitives and low-level interfaces component-platform Portability layer and build scripts historical-reviewing Currently reviewing (for legacy PR/issues) needs-review Every commit must be reviewed by at least two team members,
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve runtime cpu feature detection
5 participants