Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for using VAES instructions for NI parallel operations. #396

Closed
wants to merge 5 commits into from
Closed

Add support for using VAES instructions for NI parallel operations. #396

wants to merge 5 commits into from

Conversation

silvanshade
Copy link
Contributor

This PR adds support for using VAES intrinsics for the ni backend for the aes 8-fold operations.

The change shows a nice speed up on Zen4 CPUs at least.

Benchmarks (Ryzen 7950x):

RUSTFLAGS="-C target-cpu=native" cargo bench:

running 15 tests
test aes128_decrypt_block  ... bench:       1,043 ns/iter (+/- 83) = 15708 MB/s
test aes128_decrypt_blocks ... bench:         944 ns/iter (+/- 4) = 17355 MB/s
test aes128_encrypt_block  ... bench:       1,042 ns/iter (+/- 2) = 15723 MB/s
test aes128_encrypt_blocks ... bench:         944 ns/iter (+/- 23) = 17355 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,299 ns/iter (+/- 5) = 12612 MB/s
test aes192_decrypt_blocks ... bench:       1,142 ns/iter (+/- 78) = 14346 MB/s
test aes192_encrypt_block  ... bench:       1,300 ns/iter (+/- 2) = 12603 MB/s
test aes192_encrypt_blocks ... bench:       1,142 ns/iter (+/- 24) = 14346 MB/s
test aes192_new            ... bench:          10 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,622 ns/iter (+/- 18) = 10101 MB/s
test aes256_decrypt_blocks ... bench:       1,330 ns/iter (+/- 84) = 12318 MB/s
test aes256_encrypt_block  ... bench:       1,622 ns/iter (+/- 5) = 10101 MB/s
test aes256_encrypt_blocks ... bench:       1,330 ns/iter (+/- 86) = 12318 MB/s
test aes256_new            ... bench:          12 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 13.07

RUSTFLAGS="-C target-cpu=native" cargo bench --features vaes:

running 15 tests
test aes128_decrypt_block  ... bench:       1,040 ns/iter (+/- 19) = 15753 MB/s
test aes128_decrypt_blocks ... bench:         464 ns/iter (+/- 7) = 35310 MB/s
test aes128_encrypt_block  ... bench:       1,039 ns/iter (+/- 25) = 15769 MB/s
test aes128_encrypt_blocks ... bench:         464 ns/iter (+/- 7) = 35310 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,300 ns/iter (+/- 49) = 12603 MB/s
test aes192_decrypt_blocks ... bench:         556 ns/iter (+/- 7) = 29467 MB/s
test aes192_encrypt_block  ... bench:       1,295 ns/iter (+/- 28) = 12651 MB/s
test aes192_encrypt_blocks ... bench:         557 ns/iter (+/- 8) = 29414 MB/s
test aes192_new            ... bench:          10 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,619 ns/iter (+/- 58) = 10119 MB/s
test aes256_decrypt_blocks ... bench:         650 ns/iter (+/- 7) = 25206 MB/s
test aes256_encrypt_block  ... bench:       1,616 ns/iter (+/- 33) = 10138 MB/s
test aes256_encrypt_blocks ... bench:         649 ns/iter (+/- 6) = 25244 MB/s
test aes256_new            ... bench:          12 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 15.20s

I experimented with changing ParBlocksSize to 32 and unfolding the loop more for the VAES case to see if it made a difference, but at least on Zen4 it didn't seem to matter.

One thing I noticed is that it is quite important that the target-cpu is set correctly, otherwise the performance can be bad:

cargo bench --features vaes:

running 15 tests
test aes128_decrypt_block  ... bench:       1,308 ns/iter (+/- 44) = 12525 MB/s
test aes128_decrypt_blocks ... bench:      18,713 ns/iter (+/- 694) = 875 MB/s
test aes128_encrypt_block  ... bench:       1,340 ns/iter (+/- 10) = 12226 MB/s
test aes128_encrypt_blocks ... bench:      18,676 ns/iter (+/- 569) = 877 MB/s
test aes128_new            ... bench:          26 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,530 ns/iter (+/- 16) = 10708 MB/s
test aes192_decrypt_blocks ... bench:      21,871 ns/iter (+/- 754) = 749 MB/s
test aes192_encrypt_block  ... bench:       1,531 ns/iter (+/- 11) = 10701 MB/s
test aes192_encrypt_blocks ... bench:      22,029 ns/iter (+/- 736) = 743 MB/s
test aes192_new            ... bench:          30 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,777 ns/iter (+/- 67) = 9220 MB/s
test aes256_decrypt_blocks ... bench:      25,237 ns/iter (+/- 1,035) = 649 MB/s
test aes256_encrypt_block  ... bench:       1,741 ns/iter (+/- 53) = 9410 MB/s
test aes256_encrypt_blocks ... bench:      25,090 ns/iter (+/- 1,023) = 653 MB/s
test aes256_new            ... bench:          79 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 53.70s

Regarding adding the vaes feature to the Cargo.toml, rather than using cpufeatures, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.

This is partly due to the fact that some of the instructions are gated behind stdsimd.

Also, as noted in another thread on the Rust forums, there isn't really a way to handle the negation of a case for target_feature, so it would be difficult to figure out how to override the selection of the usual ni 8-fold operations with the vaes versions.

But if anyone has suggestions on how to structure this better I'd be happy to make those changes.

@silvanshade silvanshade changed the title Add support for using VAES instructions NI parallel operations. Add support for using VAES instructions for NI parallel operations. Jan 7, 2024
@tarcieri
Copy link
Member

tarcieri commented Jan 8, 2024

Regarding adding the vaes feature to the Cargo.toml, rather than using cpufeatures, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.

@silvanshade I'd definitely recommend trying to get cpufeatures working. The latest v0.2.12 release just added support for detecting VAES.

Structurally it'd look pretty much like what you have, but you'd have both the ni and vaes modules linked on x86-ish targets.

You'd need to add detection for VAES, and a branch to use it if available.

If that's not something you're particularly interested in, we can work with this and @newpavlov or myself can complete it.

Either way, thanks!

@silvanshade
Copy link
Contributor Author

@silvanshade I'd definitely recommend trying to get cpufeatures working. The latest v0.2.12 release just added support for detecting VAES.

Yeah, I had created the PR that added that, since I was originally going to try and use cpufeatures for this.

You'd need to add detection for VAES, and a branch to use it if available.

I can create a branch like this. That's not the main difficulty, as I understand it.

Rather, the VAES features aren't usable without also enabling stdsimd and avx512_target_feature, which I added to the top of lib.rs.

I could change it to where there is a --cfg vaes rather than the cargo feature, so it works more like the aes_armv8 or something, but there still needs to be something to conditionally enable those features, otherwise aes won't even be able to compile on stable anymore.

I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.

What do you think?

@tarcieri
Copy link
Member

tarcieri commented Jan 9, 2024

I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.

That sounds great!

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better to write a separate implementation in the vaes module instead of piggybacking on the ni module. It's also probably worth to increase number of blocks processed in parallel for the VAES backend. Right now, you call only two aesdec/aesenc functions per round, thus potentially loosing on additional ILP-based throughput (the instructions have latency of 3 cycles and throughput of 1 cycle). Additionally, with AVX-512 you have 32 ZMM registers, so you have less register pressure.

aes/src/ni/test_expand.rs Outdated Show resolved Hide resolved
aes/src/vaes/utils.rs Outdated Show resolved Hide resolved
aes/src/vaes/utils.rs Outdated Show resolved Hide resolved
aes/src/vaes/utils.rs Outdated Show resolved Hide resolved
@silvanshade
Copy link
Contributor Author

I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.

That sounds great!

So it is possible to write local versions of these intrinsics, e.g.,

#[inline]
#[target_feature(enable = "avx512f")]
pub(super) unsafe fn pf_mm512_aesdec_epi128(data: __m512i, round_key: __m512i) -> __m512i {
    let result: __m512i;
    asm!(
        "vaesdec {result}, {data}, {round_key}",
        data = in(zmm_reg) data,
        round_key = in(zmm_reg) round_key,
        result = out(zmm_reg) result,
        options(pure, nomem, nostack, preserves_flags)
    );
    result
}

But what I didn't realize is that it's still necessary to have the #[target_feature(enable = "avx512f")], which in turn requires #![feature(avx512_target_feature)] (which is unstable) in order to use the AVX registers. So I think at minimum we would probably need to add a "nightly" feature or something to gate these on and can't quite get away with just autodetection and local versions of the intrinsics.

If that seems reasonable, I will add that feature ("nightly", or "unstable", or whatever else you'd prefer to call it) and then implement the autodetection for vaes.

@silvanshade
Copy link
Contributor Author

silvanshade commented Jan 9, 2024

I think it will be better to write a separate implementation in the vaes module instead of piggybacking on the ni module. It's also probably worth to increase number of blocks processed in parallel for the VAES backend. Right now, you call only two aesdec/aesenc functions per round, thus potentially loosing on additional ILP-based throughput (the instructions have latency of 3 cycles and throughput of 1 cycle). Additionally, with AVX-512 you have 32 ZMM registers, so you have less register pressure.

Initially I was planning to do that but the reason I opted not to is because, for the single block case, we still basically want to fall back to the NI implementation, don't we? Maybe for maintenance or structural reasons it would be cleaner to just duplicate that code though?

I did try increasing the parallel blocks to 32 (calling 8 of the respective instructions) but didn't notice a performance difference in the benchmarks here, although I only have the one system to test on. But I agree it probably makes sense in general, and especially for a separate backend.

@newpavlov
Copy link
Member

for the single block case, we still basically want to fall back to the NI implementation, don't we? Maybe for maintenance or structural reasons it would be cleaner to just duplicate that code though?

Yes, also key expansion code will be the same. But I think that the parallel processing function definitely should live in the vaes module. It will allow us to change number of blocks processed in parallel, will remove the unnecessary key broadcasts and casting between __m512i and __m256i.

So I think we should define separate backends (i.e. structs which implement the BlockBackend trait) which will use functions from the ni module for single block processing and key expansion.

@tarcieri
Copy link
Member

tarcieri commented Jan 9, 2024

But what I didn't realize is that it's still necessary to have the #[target_feature(enable = "avx512f")], which in turn requires #![feature(avx512_target_feature)] (which is unstable) in order to use the AVX registers.

@silvanshade aah yes, that's unfortunate. I've ran into similar issues in the past and the only way I solved it was opening a stabilization PR for the relevant target features (which in the past I did manage to get one merged), although offhand I'm not sure what the blockers are.

Not seeing much discussion here either: rust-lang/rust#44839

@silvanshade
Copy link
Contributor Author

silvanshade commented Feb 6, 2024

@tarcieri @newpavlov I’ve updated the PR and tried to address prior feedback.

There’s also a companion PR for an action here

This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt. We still have to do the broadcasting at least once, but now we can limit that to just the key schedule functions and avoid the additional overhead for parallel encrypt/decrypt.

One thing I did not address is trying to merge the VAES backend into the autodetect framework.

The reason for this basically is that: although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time.

This is a problem specifically having to do with the key size. For instance, if we wanted to have a backend which dynamically selected between AESNI or VAES, we have to compromise on either using __m128i for the round keys (and broadcasting to __m512i frequently), or using __m512i for the round keys (and casting to __m128i).

Both are problematic. Going from __m128i to __m512i is inefficient. Going the other way from __m512i to __m128i could actually potentially be viable, except for the fact that __m512i isn’t available for use in stable Rust currently. So going that route would force even the AESNI backend to require nightly.

Given that, I thought it would be best to just keep the backends separate for the time being.

In order to use the VAES backend, the target_feature=+vaes must be specified, and a nightly toolchain must be used. There are more details in the comments I added.

Also, I increased the block size for VAES to 64. Going from 32 to 64 doesn’t seem to make any difference on my system, but then neither did going from 8 to 32. But potentially it could make a difference somewhere. I suspect the reason a difference isn’t noticeable though is because the compiler is probably doing a decent job of unrolling the loops already, at least for these tests.

@silvanshade silvanshade requested a review from newpavlov February 6, 2024 22:03
@newpavlov
Copy link
Member

newpavlov commented Feb 7, 2024

This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt.

I don't think it's worth to store broadcasted keys as part of Aes* states. It could be better to store them only in backends, i.e. instead of reference to an Aes* state they could store broadcasted copy of round keys.

although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time.

I think the only way for working around this is instead of using polyfills to implement encrypt/decrypt functions as one asm! block with explicitly named registers (you can use macros to reduce amount of boilerplate). This way the code will have no mentions of __m512i types, which require the unstable target features, but it would mean that we have to do round key broadcasts on each encrypt/decrypt. It's an unfortunate cost, but I think we can live with it until AVX-512 intrinsics get stabilized. I think (but not 100% sure) that clobber_abi("C") should properly handle clobbering of ZMM registers.

@silvanshade
Copy link
Contributor Author

This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt.

I don't think it's worth to store broadcasted keys as part of Aes* states. It could be better to store them only in backends, i.e. instead of reference to an Aes* state they could store broadcasted copy of round keys.

Can you elaborate on this? I'm not entirely sure I understand what this would look like or why this would be beneficial.

Is the idea that this would make it to where the non-broadcasted round keys are still available for proc_block and only proc_par_blocks would use the broadcasted round keys?

@newpavlov
Copy link
Member

newpavlov commented Feb 7, 2024

The main reason is that it would quadruple the size of Aes* states. Even worse, with enabled autodetection it would affect targets without AVX-512 (remember that we use union in this case). And since broadcasted keys contain simple copies, it feels quite wasteful.

Is the idea that this would make it to where the non-broadcasted round keys are still available for proc_block and only proc_par_blocks would use the broadcasted round keys?

Yes. Instead of this:

struct $name_enc {
    round_keys: [__m512i; $rounds],
}

struct $name_back_enc<'a>(&'a $name_enc);

It would be better to write this:

struct $name_enc {
    round_keys: [__m128i; $rounds],
}

struct $name_back_enc<'a> {
    // Owned copy of broadcasted round keys
    k1: [__m512i; $rounds],
    // References $name_enc
    k2: &'a [__m128i; $rounds],
}

During parallel block processing the broadcasted round keys are likely to stay in registers and may not be even spilled to stack (assuming you will use an appropriate value for ParBlocksSize).

@silvanshade
Copy link
Contributor Author

silvanshade commented Feb 8, 2024

Okay, I tried refactoring how you suggested.

Initially, the results were a little surprising, because the single block case was suddenly far slower than before the refactoring.

Before splitting the key representation:

running 15 tests
test aes128_decrypt_block  ... bench:       1,333 ns/iter (+/- 28) = 12291 MB/s
test aes128_decrypt_blocks ... bench:         474 ns/iter (+/- 31) = 34565 MB/s
test aes128_encrypt_block  ... bench:       1,310 ns/iter (+/- 32) = 12506 MB/s
test aes128_encrypt_blocks ... bench:         474 ns/iter (+/- 7) = 34565 MB/s
test aes128_new            ... bench:          36 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,514 ns/iter (+/- 38) = 10821 MB/s
test aes192_decrypt_blocks ... bench:         567 ns/iter (+/- 7) = 28895 MB/s
test aes192_encrypt_block  ... bench:       1,510 ns/iter (+/- 53) = 10850 MB/s
test aes192_encrypt_blocks ... bench:         566 ns/iter (+/- 8) = 28946 MB/s
test aes192_new            ... bench:          38 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,713 ns/iter (+/- 44) = 9564 MB/s
test aes256_decrypt_blocks ... bench:         659 ns/iter (+/- 7) = 24861 MB/s
test aes256_encrypt_block  ... bench:       1,724 ns/iter (+/- 25) = 9503 MB/s
test aes256_encrypt_blocks ... bench:         656 ns/iter (+/- 10) = 24975 MB/s
test aes256_new            ... bench:          48 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 20.03s

After splitting the key representation:

running 15 tests
test aes128_decrypt_block  ... bench:       6,409 ns/iter (+/- 71) = 2556 MB/s
test aes128_decrypt_blocks ... bench:         475 ns/iter (+/- 70) = 34492 MB/s
test aes128_encrypt_block  ... bench:       6,379 ns/iter (+/- 59) = 2568 MB/s
test aes128_encrypt_blocks ... bench:         471 ns/iter (+/- 5) = 34785 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       8,038 ns/iter (+/- 140) = 2038 MB/s
test aes192_decrypt_blocks ... bench:         567 ns/iter (+/- 34) = 28895 MB/s
test aes192_encrypt_block  ... bench:       7,988 ns/iter (+/- 73) = 2051 MB/s
test aes192_encrypt_blocks ... bench:         564 ns/iter (+/- 8) = 29049 MB/s
test aes192_new            ... bench:          11 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       9,397 ns/iter (+/- 281) = 1743 MB/s
test aes256_decrypt_blocks ... bench:         661 ns/iter (+/- 32) = 24786 MB/s
test aes256_encrypt_block  ... bench:       9,407 ns/iter (+/- 267) = 1741 MB/s
test aes256_encrypt_blocks ... bench:         657 ns/iter (+/- 24) = 24937 MB/s
test aes256_new            ... bench:          12 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 11.98s

The only thing that really changed was that I moved the call to .map on [__m128i; N] array from the expand_key and inv_expanded_keys functions to the get_enc_backend and get_dec_backend functions. But also the original array was kept around now (as a copy), whereas before it was consumed.

This made me suspect maybe the compiler was able to optimize the previous case better.

I tried switching to a slightly different representation where the parallel keys are lazily initialized, only if proc_par_blocks is called, and indeed that seemed to restore the original performance:

running 15 tests
test aes128_decrypt_block  ... bench:       1,257 ns/iter (+/- 31) = 13034 MB/s
test aes128_decrypt_blocks ... bench:         472 ns/iter (+/- 19) = 34711 MB/s
test aes128_encrypt_block  ... bench:       1,254 ns/iter (+/- 58) = 13065 MB/s
test aes128_encrypt_blocks ... bench:         482 ns/iter (+/- 25) = 33991 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,443 ns/iter (+/- 44) = 11354 MB/s
test aes192_decrypt_blocks ... bench:         580 ns/iter (+/- 30) = 28248 MB/s
test aes192_encrypt_block  ... bench:       1,444 ns/iter (+/- 32) = 11346 MB/s
test aes192_encrypt_blocks ... bench:         590 ns/iter (+/- 37) = 27769 MB/s
test aes192_new            ... bench:          11 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,671 ns/iter (+/- 46) = 9804 MB/s
test aes256_decrypt_blocks ... bench:         673 ns/iter (+/- 15) = 24344 MB/s
test aes256_encrypt_block  ... bench:       1,673 ns/iter (+/- 37) = 9793 MB/s
test aes256_encrypt_blocks ... bench:         686 ns/iter (+/- 15) = 23883 MB/s
test aes256_new            ... bench:          13 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 28.50s

aes/src/lib.rs Outdated
Comment on lines 161 to 164
any(target_arch = "x86", target_arch = "x86_64"),
not(aes_force_soft)
))] {
mod x86_64;
Copy link
Member

@tarcieri tarcieri Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module is gated on target_arch = "x86" in addition to x86_64 so perhaps the module name should just be x86?

That said I'm not sure we've actually successfully tested using AES-NI via a 32-bit build

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good question, I wasn't sure about that either.

The reason I opted for x86_64 though is because I was under the impression that the AES-NI extension was only actually available for x86_64 architecture CPUs.

But from a brief google search it seems that it should be possible to get AES-NI to work for 32-bit targets though (based on some old Intel sample code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should still work when targeting those CPUs with a 32-bit binary, even if they're natively 64-bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

I think I’d opt to keep it as x86_64 since that’s the ISA where it’s available (even if running 32-bit target binaries), and also because 32-bit is slowly disappearing pretty much everywhere anyway.

But I don’t feel particularly strongly about it.

Would you like me to change it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit weird to say something is x86_64 if it still works on a 32-bit target

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check whether a 32 bit binary which uses VAES instructions works or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check whether a 32 bit binary which uses VAES instructions works or not?

I just tested and it does work.

I used the following .cargo/config.toml:

[build]
target = "i686-unknown-linux-gnu"

[target.i686-unknown-linux-gnu]
linker = "clang-17"
rustflags = [
  "-C", "link-arg=-fuse-ld=lld-17",
  "-C", "link-arg=--target=i686-unknown-linux-gnu",
  "-C", "target-feature=+aes,+sse3,+vaes",
]

This is with nightly-2024-07-02 on Ubuntu Mantic 23.10 using Debian multiarch configuration (with i386 architecture added).

I also renamed the module to x86.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note that I can't seem to actually test the 32-bit VAES target on CI though since I can't quite figure out how to get intel SDE to run those 32-bit binaries on the 64-bit host. It looks like it should be possible for it to run those but everything I've tried has resulted in it refusing to run them or just crashing (before getting to the tests).

@newpavlov
Copy link
Member

newpavlov commented Feb 8, 2024

The only thing that really changed was that I moved the call to .map on [__m128i; N] array from the expand_key and inv_expanded_keys functions to the get_enc_backend and get_dec_backend functions. But also the original array was kept around now (as a copy), whereas before it was consumed.

Compiler also can sometimes have difficulties with optimizing array::map-based code, especially with large arrays, because the method uses additional code to be safe against potential panics in passed closure.

@silvanshade
Copy link
Contributor Author

silvanshade commented Feb 11, 2024

I've added VAES support to the autodetection.

This required refactoring some parts of the autodetection code to handle in a cleaner way.

In order to handle VAES on stable I resorted to using inline asm!. One downside to this approach is that the asm! macro complains that zmm registers are not usable on i686 targets (despite the fact that it's possible to compile for that with the intrinsics).

To work around that I just feature gated the VAES backend to only work on x86_64 targets. I doubt anyone will ever actually want to compile with this feature for 32-bit targets anyway, but in the future it could be possible by switching back to the intrinsics.

I didn't change the hazmat code to include VAES since I wasn't sure if you want to increase the block size there or not.

I think this addresses basically all of the feedback now?

@silvanshade silvanshade requested a review from tarcieri February 11, 2024 01:34
@silvanshade
Copy link
Contributor Author

silvanshade commented Feb 13, 2024

Two more small changes:

  1. I also added a VAES backend for AVX (256-bit) since it's possible (on future Intel CPUs) to have a scenario where VAES is available but AVX512 is not.
  2. I added a --cfg disable_avx512 check which will force the 256-bit VAES backend even if AVX512 is available. This may be useful in scenarios where downclocking from AVX512 is an issue. It's also useful for benchmarking.

@silvanshade
Copy link
Contributor Author

@tarcieri @newpavlov Do you intend to merge this?

@tarcieri
Copy link
Member

I'd generally be in favor but it's definitely a large PR. Sorry it's gone by the wayside. I will hopefully have time to review soon. Also curious to know what @newpavlov thinks.

@silvanshade
Copy link
Contributor Author

Thanks.

I would like to resume working the RISC-V and ARMv9 PRs (especially the latter will be relevant soon since Apple Silicon M4 is ARMv9 with SVE2/SME) but prefer to see how this one lands first before putting a lot more effort into those.

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review!

aes/src/x86/vaes256/aes128.rs Outdated Show resolved Hide resolved
"vmovdqu ymm14, [{iptr} + 13 * 32]",
"vmovdqu ymm15, [{iptr} + 14 * 32]",
// aes-128 round 0 encrypt
"vmovdqu ymm0 , [{simd_256_keys} + 0 * 32]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would keep the round keys in registers (and adjust number of blocks processed in parallel accordingly). We probably can do it in this case since __m256i is available on stable.

pub(crate) unsafe fn parallelize_keys(keys: &RoundKeys<11>) -> Simd256RoundKeys<11> {
let mut v256: [MaybeUninit<__m256i>; 11] = MaybeUninit::uninit().assume_init();
asm! {
"vbroadcastf128 ymm0 , [{keys} + 0 * 16]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to use vbroadcastf128 instead of vmovdqu ymm0 , [{simd_256_keys} + i * 32] in the encrypt/decrypt functions? If performance will be comparable, I would prefer to use the latter since it does not spill round keys on stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original version I was using broadcasts (in Rust IIRC, but essentially the same) in the encrypt/decrypt functions.

But in the discussion (starting back from here) I refactored the code to store the broadcasts since I understood you wanted to eliminate those as unnecessary operations.

Neither approach made a difference to performance in the included benchmarks.

Are you saying you would prefer to go back to the broadcast on demand approach?

Copy link
Member

@newpavlov newpavlov May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to pre-broadcast round keys for vaes256 and do broabcasts before each round for vaes512.

With vaes256 you can keep borabcasted keys in backend states as [__mm256i; N] and pass them into inline assembly blocks as values. Hopefully, the compiler will be able to eliminate stack spilling and the broadcasted round keys will stay strictly in registers (but we would need to inspect generated assembly to check it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above proposal is too complex for your liking, then you can broadcast round keys before each round for vaes256 as well.

Copy link
Contributor Author

@silvanshade silvanshade Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to pre-broadcast round keys for vaes256 and do broabcasts before each round for vaes512.

With vaes256 you can keep borabcasted keys in backend states as [__mm256i; N] and pass them into inline assembly blocks as values. Hopefully, the compiler will be able to eliminate stack spilling and the broadcasted round keys will stay strictly in registers (but we would need to inspect generated assembly to check it).

I'm not sure if this makes sense because for the VAES 256 case, there aren't enough registers to hold all the keys while also processing the data (which is why I did the interleaving).

Note also that the keys are already stored like this in the backends:

type RoundKeys<const ROUNDS: usize> = [__m128i; ROUNDS];
#[cfg(target_arch = "x86_64")]
type Simd256RoundKeys<const ROUNDS: usize> = [__m256i; ROUNDS];
#[cfg(target_arch = "x86_64")]
type Simd512RoundKeys<const ROUNDS: usize> = [__m512i; ROUNDS];

They are also only populated on-demand for the first VAES parallel block processing call (which eliminates the overhead in case encrypt/decrypt is only used for a single block):

#[cfg(target_arch = "x86_64")]
impl<'a> BlockBackend for $name_backend::Vaes512<'a, self::$name_backend::mode::Encrypt> {
    #[inline]
    fn proc_block(&mut self, block: InOut<'_, '_, Block>) {
        unsafe {
            self::ni::$module::encrypt1(self.keys, block);
        }
    }
    #[inline]
    fn proc_par_blocks(&mut self, blocks: InOut<'_, '_, Block64>) {
        unsafe {
            let simd_512_keys = self.simd_512_keys.get_or_insert_with(|| {
                self::vaes512::$module::parallelize_keys(&self.keys)
            });
            self::vaes512::$module::encrypt64(simd_512_keys, blocks);
        }
    }
}

I believe this latter point is relevant with regard to whether or not we should pre-broadcast vs. broadcast before each round: the way it is now does eliminate overhead from creating all this data (as measured here).

I can still check whether or not interleaving the loads for VAES 512 (and increasing block count) makes a difference. I suspect it won't on my system, since most of the modifications I've made to the algorithm seem to have no noticeable impact, but it could be better in theory for other systems perhaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid you misunderstood me regarding implementation of backends. I am now a bit of short on time, so I will try explain myself better in the following days.

I'm not sure if this makes sense because for the VAES 256 case, there aren't enough registers to hold all the keys while also processing the data (which is why I did the interleaving).

There is enough registers for AES-128, i.e. we have 16 registers and 11 rounds keys, thus we can use the remaining 5 to process 10 blocks in parallel.

For AES-192 (13 rounds) we could keep everything in registers, but at the cost of processing only 6 blocks in parallel. We also could load+broadcast round keys on each round and process 30 blocks in parallel. And, of course, we could have solutions in-between (e.g. process 16 blocks in parallel and keep 7 round keys in registers). Processing less blocks in parallel means less exploitation of ILP, but processing more blocks requires addition of extra loads. Round keys are likely to be in L1 cache, but it's still several cycles. But superscalar processors have their own bag of tricks to deal with such situations... To summarize: if there is not measurable performance difference we probably should probably prefer a "prettier" code.

Finally, for AES-256 (15 rounds) we have not choice but to load+broadcast round keys on each round.

I can still check whether or not interleaving the loads for VAES 512 (and increasing block count) makes a difference.

It's likely your bottleneck here is memory throughput, this is why such minor changes don't make any difference.

Copy link
Contributor Author

@silvanshade silvanshade Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid you misunderstood me regarding implementation of backends. I am now a bit of short on time, so I will try explain myself better in the following days.

I just refactored the backends to support separate block sizes, as you requested.

Now they look like this:

// Backend structures
mod $name_backend {
    use super::*;
    pub(crate) mod mode {
        pub(crate) struct Encrypt;
        pub(crate) struct Decrypt;
    }
    #[derive(Clone)]
    pub(crate) struct Ni<'a, Mode> {
        pub(crate) mode: core::marker::PhantomData<Mode>,
        pub(crate) keys: &'a RoundKeys<$rounds>,
    }
    #[derive(Clone)]
    #[cfg(target_arch = "x86_64")]
    pub(crate) struct Vaes256<'a, Mode> {
        pub(crate) mode: core::marker::PhantomData<Mode>,
        pub(crate) keys: &'a RoundKeys<$rounds>,
        pub(crate) simd_256_keys: Option<Simd256RoundKeys<$rounds>>,
    }
    #[cfg(target_arch = "x86_64")]
    pub(crate) struct Vaes512<'a, Mode> {
        pub(crate) mode: core::marker::PhantomData<Mode>,
        pub(crate) keys: &'a RoundKeys<$rounds>,
        pub(crate) simd_512_keys: Option<Simd512RoundKeys<$rounds>>,
    }
}

// For dispatching on the correct backend
#[derive(Clone)]
enum Backend {
    Ni,
    #[cfg(target_arch = "x86_64")]
    Vaes256,
    #[cfg(target_arch = "x86_64")]
    Vaes512,
}

// For detecting which backend to select
#[derive(Clone)]
struct Features {
    #[cfg(target_arch = "x86_64")]
    avx: self::features::avx::InitToken,
    #[cfg(target_arch = "x86_64")]
    avx512f: self::features::avx512f::InitToken,
    #[cfg(target_arch = "x86_64")]
    vaes: self::features::vaes::InitToken,
}

impl Features {
    fn new() -> Self {
        Self {
            #[cfg(target_arch = "x86_64")]
            avx: self::features::avx::init(),
            #[cfg(target_arch = "x86_64")]
            avx512f: self::features::avx512f::init(),
            #[cfg(target_arch = "x86_64")]
            vaes: self::features::vaes::init(),
        }
    }

    fn backend(&self) -> Backend {
        #[allow(unused_mut)]
        let mut backend = Backend::Ni;
        #[cfg(target_arch = "x86_64")]
        if !cfg!(disable_avx512) && self.avx512f.get() && self.vaes.get() {
            backend = self::Backend::Vaes512;
        }
        #[cfg(target_arch = "x86_64")]
        if !cfg!(disable_avx256) && self.avx.get() && self.vaes.get() {
            backend = self::Backend::Vaes256;
        }
        backend
    }
}

#[doc=$doc]
#[doc = "block cipher (decrypt-only)"]
#[derive(Clone)]
pub struct $name_dec {
    round_keys: RoundKeys<$rounds>,
    features: Features,
}

impl BlockCipherDecrypt for $name_dec {
    #[inline]
    fn decrypt_with_backend(&self, f: impl BlockClosure<BlockSize = U16>) {
        let mode = core::marker::PhantomData::<self::$name_backend::mode::Decrypt>;
        let keys = &self.round_keys;
        match self.features.backend() {
            self::Backend::Ni => f.call(&mut $name_backend::Ni { mode, keys }),
            #[cfg(target_arch = "x86_64")]
            self::Backend::Vaes256 => f.call(&mut $name_backend::Vaes256 {
                mode,
                keys,
                simd_256_keys: None,
            }),
            #[cfg(target_arch = "x86_64")]
            self::Backend::Vaes512 => f.call(&mut $name_backend::Vaes512 {
                mode,
                keys,
                simd_512_keys: None,
            }),
        }
    }
}

#[cfg(target_arch = "x86_64")]
impl<'a> BlockBackend for $name_backend::Vaes512<'a, self::$name_backend::mode::Decrypt> {
    #[inline]
    fn proc_block(&mut self, block: InOut<'_, '_, Block>) {
        unsafe {
            self::ni::$module::decrypt1(self.keys, block);
        }
    }
    #[inline]
    fn proc_par_blocks(&mut self, blocks: InOut<'_, '_, Block64>) {
        unsafe {
            let simd_512_keys = self.simd_512_keys.get_or_insert_with(|| {
                self::vaes512::$module::parallelize_keys(&self.keys)
            });
            self::vaes512::$module::decrypt64(simd_512_keys, blocks);
        }
    }
}

The reason I didn't do this originally was because I was trying to avoid an explosion of complexity in autodetect, since it's not really organized in such a way that works well with fine-grained feature selection within specific architectures (without adding a bunch of noisy conditionals everywhere).

I realized this was going to be an issue once I started working on the RISC-V and ARMv9 backends which also have multiple features to detect.

What I did to work around this was to remove the get_{enc,dec}_backend functions from the architecture modules, since exposing them to autodetect made it impossible to dispatch on different backends within the architecture module (since the type distinctions would escape), at least unless the return type for those methods was changed into a trait object.

This allowed me to refactor the x86 module to handle finer-grained backend dispatching without polluting autodetect further.

If there's further misunderstanding, you'll have to be more specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is enough registers for AES-128, i.e. we have 16 registers and 11 rounds keys, thus we can use the remaining 5 to process 10 blocks in parallel.

For AES-192 (13 rounds) we could keep everything in registers, but at the cost of processing only 6 blocks in parallel. We also could load+broadcast round keys on each round and process 30 blocks in parallel. And, of course, we could have solutions in-between (e.g. process 16 blocks in parallel and keep 7 round keys in registers). Processing less blocks in parallel means less exploitation of ILP, but processing more blocks requires addition of extra loads. Round keys are likely to be in L1 cache, but it's still several cycles. But superscalar processors have their own bag of tricks to deal with such situations... To summarize: if there is not measurable performance difference we probably should probably prefer a "prettier" code.

Okay I see what you mean.

I will experiment with trying to keep the keys in register.

In agreement with your last point, I'm hesitant about making the implementation more complex than it already is though, especially since I can't measure most of these differences (it would help perhaps to see the benchmarks run on intel since I am only testing on Zen4).

let (iptr, optr) = blocks.into_raw();
asm! {
// load keys
"vmovdqu32 zmm0 , [{keys} + 0 * 64]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we reload the keys on each call either way, it may be worth to increase number of blocks processed in parallel to 124 and interleave loading like you did for vaes256.

pub(crate) unsafe fn parallelize_keys(keys: &RoundKeys<15>) -> Simd512RoundKeys<15> {
let mut v512: [MaybeUninit<__m512i>; 15] = MaybeUninit::uninit().assume_init();
asm! {
"vbroadcasti32x4 zmm0 , [{keys} + 0 * 16]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

aes/src/x86/vaes256/aes128.rs Outdated Show resolved Hide resolved
aes/src/x86/vaes512/aes256.rs Outdated Show resolved Hide resolved
@silvanshade
Copy link
Contributor Author

I've made several changes since the recent feedback:

  • I completely redesigned the backend code to use distinct types rather than coalescing them for simplicity. As noted in some discussion above, this required modifying part of the interace between autodetect and the architecture modules.
  • I changed the clobbers to be explicit.
  • For VAES 256, I moved the loading/store out of the inline asm!, which allowed adding pure, readonly for those blocks.

At this point I would actually prefer not to focus much more on refactoring the algorithms (re: experimenting with block counts, broadcasting, etc).

I've put quite a lot of time into this PR already and the performance gains are pretty reasonable I think. There's always room in the future for more fine-tuning.

I'm still willing to address remaining design issues though.

@newpavlov
Copy link
Member

@silvanshade
Yes, I think it's better to experiment with minor modifications in separate PRs. I will try to fully review the code this week (likely during weekend) and probably will merge it after that (I can fix minor nits myself if needed).

@silvanshade
Copy link
Contributor Author

@tarcieri @newpavlov Any updates on this?

@silvanshade silvanshade requested a review from newpavlov June 10, 2024 23:58
@tarcieri
Copy link
Member

tarcieri commented Jun 13, 2024

@silvanshade why did you close this? It seemed pretty close to complete.

@silvanshade
Copy link
Contributor Author

I closed it because I still haven't gotten a thorough review and discussion about the implementation, even though I've repeatedly addressed all of the smaller feedback to the best of my ability.

From my perspective, there is no real evidence that this PR is "close to complete".

I thought it was basically complete months ago and asked for feedback then, and waited, and nothing happened.

I realize that maintainers are often very busy with other things but I think that it should have been possible by now to get a more concrete idea about whether this is ever likely to be merged and if not, what are the blockers.

The last substantive exchange with @newpavlov suggested I fundamentally misunderstood something about the implementation, and that was never clarified.

So I just don't think it's a good use of time to continue.

If you think otherwise, what would you suggest?

@tarcieri
Copy link
Member

tarcieri commented Jun 13, 2024

@newpavlov's last comment, as of two weeks ago, was:

I will try to fully review the code this week (likely during weekend) and probably will merge it after that (I can fix minor nits myself if needed).

It sounds like he wanted to just do one final pass before merging.

@silvanshade can you please reopen and we can get this merged?

@silvanshade
Copy link
Contributor Author

@silvanshade can you please reopen and we can get this merged?

I think it would be more productive to re-open it if or when there's a final review.

@newpavlov
Copy link
Member

Sorry for the delay! I couldn't find enough time during the previous weekend, so I will try again on this one.

Closing PR makes it less visible and increases chances of forgetting about it, so I will reopen.

@newpavlov newpavlov reopened this Jun 14, 2024
@silvanshade silvanshade deleted the features/vaes branch June 16, 2024 07:29
@silvanshade
Copy link
Contributor Author

@newpavlov Thanks for the update. Unfortunately I've deleted the branch and no longer wish to contribute to this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants