Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aes: rework backends #442

Merged
merged 9 commits into from
Aug 7, 2024
Merged

aes: rework backends #442

merged 9 commits into from
Aug 7, 2024

Conversation

newpavlov
Copy link
Member

@newpavlov newpavlov commented Jul 31, 2024

This PR unifies code between AES-NI and ARM backends and prepares ground for future removal of duplicated definitions of AES types in autodetect, soft, ni, and armv8 modules. Additionally, it allows to quickly change number of blocks processed in parallel by different intrinsics-based backends instead of hardcoding it to 8 blocks.

}
}

impl cipher::BlockBackend for &$enc_name {
Copy link
Member Author

@newpavlov newpavlov Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockBackend is implemented for references because its methods work with &mut self. We probably should introduce two separate traits BlockCipherBackend (with &self methods) and BlockModeBackend (with &mut self methods).

@newpavlov newpavlov requested a review from tarcieri July 31, 2024 17:53
@newpavlov newpavlov marked this pull request as ready for review July 31, 2024 17:54
aes/src/ni.rs Outdated
dec_name = Aes128BackDec,
key_size = consts::U16,
keys_ty = expand::Aes128RoundKeys,
par_size = consts::U15,
Copy link
Member Author

@newpavlov newpavlov Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since x86 has only 16 XMM registers (AVX-512 is out of scope for now), processing 15 blocks in parallel on x86 means that each round key will be loaded on each iteration. It maximizes ILP, but introduces additional loads from L1 cache.

For AES-128, 192, and 256 we can process only 5, 3, and 1 block in parallel respectively without reloading some keys. On my laptop the sweet spot seems to be 11 blocks (according to the crate ECB benchmarks ~5% better than the 15 blocks baseline), but it's likely highly dependent on CPU model. We will need additional benchmarks including the CTR mode to find optimal numbers. For now, I decided to use 15 blocks for cleaner assembler. I also considered using inline assembly to work around the stack spilling issue, but it's better to try it in a separate PR.

Generated assembly for AES-128 looks approximately like this: https://rust.godbolt.org/z/or5ccd5da

UPD: After measuring performance a bit more carefully using Criterion, 9 blocks produce the best result, at least on AMD CPUs. For AES-128 and AES-192 similar results are achieved with 11 and 10 blocks respectively, but since 9 blocks result in a bit smaller code, so I updated code to use it. Surprisingly, 8 blocks result in ~5-10% smaller throughput.

aes/src/armv8.rs Outdated
dec_name = Aes128BackDec,
key_size = consts::U16,
keys_ty = expand::Aes128RoundKeys,
par_size = consts::U15,
Copy link
Member Author

@newpavlov newpavlov Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARMv8 NEON has 32 SIMD registers, so techincally we can process 21, 19, and 17 blocks in parallel for AES-128, 192, and 256 respectively while keeping round keys in registers. But since the code forces inlining, it also balloons binary size, so additional benchmarks are needed.

Generated assembly for AES-128 looks approximately like this: https://rust.godbolt.org/z/EWzPe47c6

@newpavlov newpavlov merged commit daac7ea into master Aug 7, 2024
25 checks passed
@newpavlov newpavlov deleted the aes_back_rework branch August 7, 2024 14:58
newpavlov added a commit to RustCrypto/traits that referenced this pull request Aug 14, 2024
This PR splits `BlockBackend` traits into 4 specific traits:
`BlockCipherEncBackend`, `BlockCipherDecBackend`, `BlockModeEncBackend`,
and `BlockModeDecBackend`. Same for `BlockClosure`. This allows for
cipher backends to remove awkard `&mut &backend` juggling (see
RustCrypto/block-ciphers#442), makes code a bit
easier to read (i.e. `encrypt_blocks` instead of `proc_blocks`), and
allows for one backend type to be used for both encryption and
decryption.

The `impl_simple_block_encdec` macro is removed since we now can
implement the backend traits directly on cipher types, which should make
implementation crates slightly easier to understand.

Additionally, it moves traits to the `block` and `cipher` modules to
reduce clutter in the crate root. Later we can add docs to each module
to describe the traits in detail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant