Add support for AVX2-VAES #4287

randombit · 2024-08-05T02:23:37Z

On an AMD Zen3 system, results in 50% performance improvement for bulk AES.

coveralls · 2024-08-05T03:09:45Z

coverage: 91.347% (-0.4%) from 91.722%
when pulling e42d659 on jack/vaes
into 954a758 on master.

reneme · 2024-08-05T06:54:53Z

src/lib/block/aes/aes_vaes/aes_vaes.cpp

+   const SIMD_8x32 K9 = SIMD_8x32::load_le128(&m_EK[4 * 9]);
+   const SIMD_8x32 K10 = SIMD_8x32::load_le128(&m_EK[4 * 10]);
+
+   while(blocks >= 8) {


Side note: Perhaps that's a good opportunity to promote my suggestion for the BufferTransformer: which would allow formulating this multi-block-wise transformation like so:

transformer.process_blocks_of<BS*8, BS*2, BS>(overloaded{ [](std::span<const uint8_t, BS*8> in, std::span<const uint8_t, BS*8> out) { // ... } /* ... */ });

👼

Sorry I need to review that still! VAES (and AES-NI, etc) will actually be a very good testcase for this since the code should compile down to just straightline SIMD instructions

vaesenc %ymm13, %ymm3, %ymm3 # _413, tmp337, tmp347 vaesenc %ymm13, %ymm2, %ymm2 # _413, tmp340, tmp348 vaesenc %ymm13, %ymm1, %ymm1 # _413, tmp343, tmp349 vaesenc %ymm13, %ymm0, %ymm0 # _413, tmp346, tmp350 ...

I just tried the BufferTransformer for AES-NI-128 and here are the inner loops (first for 4 blocks and second for a single block). Looks good to me.

a49b0: f3 44 0f 6f 1e movdqu (%rsi),%xmm11 a49b5: f3 44 0f 6f 66 10 movdqu 0x10(%rsi),%xmm12 a49bb: f3 44 0f 6f 6e 20 movdqu 0x20(%rsi),%xmm13 a49c1: f3 44 0f 6f 76 30 movdqu 0x30(%rsi),%xmm14 a49c7: 48 83 c6 40 add $0x40,%rsi a49cb: 48 83 c1 c0 add $0xffffffffffffffc0,%rcx a49cf: 66 44 0f ef d8 pxor %xmm0,%xmm11 a49d4: 66 44 0f ef e0 pxor %xmm0,%xmm12 a49d9: 66 44 0f ef e8 pxor %xmm0,%xmm13 a49de: 66 44 0f ef f0 pxor %xmm0,%xmm14 a49e3: 66 44 0f 38 dc d9 aesenc %xmm1,%xmm11 a49e9: 66 44 0f 38 dc e1 aesenc %xmm1,%xmm12 a49ef: 66 44 0f 38 dc e9 aesenc %xmm1,%xmm13 a49f5: 66 44 0f 38 dc f1 aesenc %xmm1,%xmm14 a49fb: 66 44 0f 38 dc da aesenc %xmm2,%xmm11 a4a01: 66 44 0f 38 dc e2 aesenc %xmm2,%xmm12 a4a07: 66 44 0f 38 dc ea aesenc %xmm2,%xmm13 a4a0d: 66 44 0f 38 dc f2 aesenc %xmm2,%xmm14 a4a13: 66 44 0f 38 dc db aesenc %xmm3,%xmm11 a4a19: 66 44 0f 38 dc e3 aesenc %xmm3,%xmm12 a4a1f: 66 44 0f 38 dc eb aesenc %xmm3,%xmm13 a4a25: 66 44 0f 38 dc f3 aesenc %xmm3,%xmm14 a4a2b: 66 44 0f 38 dc dc aesenc %xmm4,%xmm11 a4a31: 66 44 0f 38 dc e4 aesenc %xmm4,%xmm12 a4a37: 66 44 0f 38 dc ec aesenc %xmm4,%xmm13 a4a3d: 66 44 0f 38 dc f4 aesenc %xmm4,%xmm14 a4a43: 66 44 0f 38 dc dd aesenc %xmm5,%xmm11 a4a49: 66 44 0f 38 dc e5 aesenc %xmm5,%xmm12 a4a4f: 66 44 0f 38 dc ed aesenc %xmm5,%xmm13 a4a55: 66 44 0f 38 dc f5 aesenc %xmm5,%xmm14 a4a5b: 66 44 0f 38 dc de aesenc %xmm6,%xmm11 a4a61: 66 44 0f 38 dc e6 aesenc %xmm6,%xmm12 a4a67: 66 44 0f 38 dc ee aesenc %xmm6,%xmm13 a4a6d: 66 44 0f 38 dc f6 aesenc %xmm6,%xmm14 a4a73: 66 44 0f 38 dc df aesenc %xmm7,%xmm11 a4a79: 66 44 0f 38 dc e7 aesenc %xmm7,%xmm12 a4a7f: 66 44 0f 38 dc ef aesenc %xmm7,%xmm13 a4a85: 66 44 0f 38 dc f7 aesenc %xmm7,%xmm14 a4a8b: 66 45 0f 38 dc d8 aesenc %xmm8,%xmm11 a4a91: 66 45 0f 38 dc e0 aesenc %xmm8,%xmm12 a4a97: 66 45 0f 38 dc e8 aesenc %xmm8,%xmm13 a4a9d: 66 45 0f 38 dc f0 aesenc %xmm8,%xmm14 a4aa3: 66 45 0f 38 dc d9 aesenc %xmm9,%xmm11 a4aa9: 66 45 0f 38 dc e1 aesenc %xmm9,%xmm12 a4aaf: 66 45 0f 38 dc e9 aesenc %xmm9,%xmm13 a4ab5: 66 45 0f 38 dc f1 aesenc %xmm9,%xmm14 a4abb: 66 45 0f 38 dd da aesenclast %xmm10,%xmm11 a4ac1: 66 45 0f 38 dd e2 aesenclast %xmm10,%xmm12 a4ac7: 66 45 0f 38 dd ea aesenclast %xmm10,%xmm13 a4acd: 66 45 0f 38 dd f2 aesenclast %xmm10,%xmm14 a4ad3: f3 44 0f 7f 1a movdqu %xmm11,(%rdx) a4ad8: f3 44 0f 7f 62 10 movdqu %xmm12,0x10(%rdx) a4ade: f3 44 0f 7f 6a 20 movdqu %xmm13,0x20(%rdx) a4ae4: f3 44 0f 7f 72 30 movdqu %xmm14,0x30(%rdx) a4aea: 48 83 c2 40 add $0x40,%rdx a4aee: 48 83 f9 3f cmp $0x3f,%rcx a4af2: 0f 87 b8 fe ff ff ja a49b0 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x60> a4af8: 48 83 f9 10 cmp $0x10,%rcx a4afc: 72 63 jb a4b61 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x211> a4afe: 31 c0 xor %eax,%eax a4b00: 48 83 f9 0f cmp $0xf,%rcx a4b04: 76 5d jbe a4b63 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x213> a4b06: f3 44 0f 6f 1c 06 movdqu (%rsi,%rax,1),%xmm11 a4b0c: 66 44 0f ef d8 pxor %xmm0,%xmm11 a4b11: 66 44 0f 38 dc d9 aesenc %xmm1,%xmm11 a4b17: 66 44 0f 38 dc da aesenc %xmm2,%xmm11 a4b1d: 66 44 0f 38 dc db aesenc %xmm3,%xmm11 a4b23: 66 44 0f 38 dc dc aesenc %xmm4,%xmm11 a4b29: 66 44 0f 38 dc dd aesenc %xmm5,%xmm11 a4b2f: 66 44 0f 38 dc de aesenc %xmm6,%xmm11 a4b35: 66 44 0f 38 dc df aesenc %xmm7,%xmm11 a4b3b: 66 45 0f 38 dc d8 aesenc %xmm8,%xmm11 a4b41: 66 45 0f 38 dc d9 aesenc %xmm9,%xmm11 a4b47: 66 45 0f 38 dd da aesenclast %xmm10,%xmm11 a4b4d: 48 83 c1 f0 add $0xfffffffffffffff0,%rcx a4b51: f3 44 0f 7f 1c 02 movdqu %xmm11,(%rdx,%rax,1) a4b57: 48 83 c0 10 add $0x10,%rax a4b5b: 48 83 f9 0f cmp $0xf,%rcx a4b5f: 77 9f ja a4b00 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x1b0>

reneme · 2024-08-05T07:01:21Z

src/lib/block/aes/aes_vaes/aes_vaes.cpp

+   const SIMD_8x32 K0 = SIMD_8x32::load_le128(&m_DK[4 * 0]);
+   const SIMD_8x32 K1 = SIMD_8x32::load_le128(&m_DK[4 * 1]);
+   const SIMD_8x32 K2 = SIMD_8x32::load_le128(&m_DK[4 * 2]);
+   const SIMD_8x32 K3 = SIMD_8x32::load_le128(&m_DK[4 * 3]);
+   const SIMD_8x32 K4 = SIMD_8x32::load_le128(&m_DK[4 * 4]);
+   const SIMD_8x32 K5 = SIMD_8x32::load_le128(&m_DK[4 * 5]);
+   const SIMD_8x32 K6 = SIMD_8x32::load_le128(&m_DK[4 * 6]);
+   const SIMD_8x32 K7 = SIMD_8x32::load_le128(&m_DK[4 * 7]);
+   const SIMD_8x32 K8 = SIMD_8x32::load_le128(&m_DK[4 * 8]);
+   const SIMD_8x32 K9 = SIMD_8x32::load_le128(&m_DK[4 * 9]);
+   const SIMD_8x32 K10 = SIMD_8x32::load_le128(&m_DK[4 * 10]);


I kind of have an itch to evaluate the possibility of std::array<SIMD_8x32, 11> K = load_le<SIMD_8x32>(...) (and friends). Shouldn't be too complicated to pull that off, I think.

This particular case is complicated because we're doing a 128 bit load then broadcasting to the high part. It's not clear if this is something generally useful, anyway it hasn't come up before now.

This particular case is complicated because we're doing a 128 bit load then broadcasting to the high part.

Fair, that indeed complicates things a bit. However, in the general case this turns out to be quite easy to integrate into the global load/store logic: With the advantage that all the convenience stuff just works then. I.e. array-based load/store, loading into existing variables, etc., even transparent strong-type unwrapping.

On an AMD Zen3 system, results in 50% performance improvement for bulk AES.

randombit force-pushed the jack/vaes branch 3 times, most recently from d71aa58 to 88a150b Compare August 5, 2024 02:32

reneme approved these changes Aug 5, 2024

View reviewed changes

randombit force-pushed the jack/vaes branch 2 times, most recently from 97d931d to e42d659 Compare August 5, 2024 11:35

Add support for AVX2-VAES

e42d659

On an AMD Zen3 system, results in 50% performance improvement for bulk AES.

randombit merged commit 13edb92 into master Aug 5, 2024
40 checks passed

randombit deleted the jack/vaes branch August 5, 2024 12:21

reneme mentioned this pull request Aug 5, 2024

Chore: BufferTransformer and Blowfish Refactor #4151

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AVX2-VAES #4287

Add support for AVX2-VAES #4287

randombit commented Aug 5, 2024

coveralls commented Aug 5, 2024 •

edited

Loading

reneme Aug 5, 2024

randombit Aug 5, 2024

reneme Aug 5, 2024

reneme Aug 5, 2024

randombit Aug 5, 2024

reneme Aug 5, 2024

Add support for AVX2-VAES #4287

Add support for AVX2-VAES #4287

Conversation

randombit commented Aug 5, 2024

coveralls commented Aug 5, 2024 • edited Loading

reneme Aug 5, 2024

Choose a reason for hiding this comment

randombit Aug 5, 2024

Choose a reason for hiding this comment

reneme Aug 5, 2024

Choose a reason for hiding this comment

reneme Aug 5, 2024

Choose a reason for hiding this comment

randombit Aug 5, 2024

Choose a reason for hiding this comment

reneme Aug 5, 2024

Choose a reason for hiding this comment

coveralls commented Aug 5, 2024 •

edited

Loading