-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aes: rework backends #442
aes: rework backends #442
Conversation
} | ||
} | ||
|
||
impl cipher::BlockBackend for &$enc_name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockBackend
is implemented for references because its methods work with &mut self
. We probably should introduce two separate traits BlockCipherBackend
(with &self
methods) and BlockModeBackend
(with &mut self
methods).
aes/src/ni.rs
Outdated
dec_name = Aes128BackDec, | ||
key_size = consts::U16, | ||
keys_ty = expand::Aes128RoundKeys, | ||
par_size = consts::U15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since x86 has only 16 XMM registers (AVX-512 is out of scope for now), processing 15 blocks in parallel on x86 means that each round key will be loaded on each iteration. It maximizes ILP, but introduces additional loads from L1 cache.
For AES-128, 192, and 256 we can process only 5, 3, and 1 block in parallel respectively without reloading some keys. On my laptop the sweet spot seems to be 11 blocks (according to the crate ECB benchmarks ~5% better than the 15 blocks baseline), but it's likely highly dependent on CPU model. We will need additional benchmarks including the CTR mode to find optimal numbers. For now, I decided to use 15 blocks for cleaner assembler. I also considered using inline assembly to work around the stack spilling issue, but it's better to try it in a separate PR.
Generated assembly for AES-128 looks approximately like this: https://rust.godbolt.org/z/or5ccd5da
UPD: After measuring performance a bit more carefully using Criterion, 9 blocks produce the best result, at least on AMD CPUs. For AES-128 and AES-192 similar results are achieved with 11 and 10 blocks respectively, but since 9 blocks result in a bit smaller code, so I updated code to use it. Surprisingly, 8 blocks result in ~5-10% smaller throughput.
aes/src/armv8.rs
Outdated
dec_name = Aes128BackDec, | ||
key_size = consts::U16, | ||
keys_ty = expand::Aes128RoundKeys, | ||
par_size = consts::U15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARMv8 NEON has 32 SIMD registers, so techincally we can process 21, 19, and 17 blocks in parallel for AES-128, 192, and 256 respectively while keeping round keys in registers. But since the code forces inlining, it also balloons binary size, so additional benchmarks are needed.
Generated assembly for AES-128 looks approximately like this: https://rust.godbolt.org/z/EWzPe47c6
This PR splits `BlockBackend` traits into 4 specific traits: `BlockCipherEncBackend`, `BlockCipherDecBackend`, `BlockModeEncBackend`, and `BlockModeDecBackend`. Same for `BlockClosure`. This allows for cipher backends to remove awkard `&mut &backend` juggling (see RustCrypto/block-ciphers#442), makes code a bit easier to read (i.e. `encrypt_blocks` instead of `proc_blocks`), and allows for one backend type to be used for both encryption and decryption. The `impl_simple_block_encdec` macro is removed since we now can implement the backend traits directly on cipher types, which should make implementation crates slightly easier to understand. Additionally, it moves traits to the `block` and `cipher` modules to reduce clutter in the crate root. Later we can add docs to each module to describe the traits in detail.
This PR unifies code between AES-NI and ARM backends and prepares ground for future removal of duplicated definitions of AES types in
autodetect
,soft
,ni
, andarmv8
modules. Additionally, it allows to quickly change number of blocks processed in parallel by different intrinsics-based backends instead of hardcoding it to 8 blocks.