Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AVX2-VAES #4287

Merged
merged 1 commit into from
Aug 5, 2024
Merged

Add support for AVX2-VAES #4287

merged 1 commit into from
Aug 5, 2024

Conversation

randombit
Copy link
Owner

On an AMD Zen3 system, results in 50% performance improvement for bulk AES.

@randombit randombit force-pushed the jack/vaes branch 3 times, most recently from d71aa58 to 88a150b Compare August 5, 2024 02:32
@coveralls
Copy link

coveralls commented Aug 5, 2024

Coverage Status

coverage: 91.347% (-0.4%) from 91.722%
when pulling e42d659 on jack/vaes
into 954a758 on master.

const SIMD_8x32 K9 = SIMD_8x32::load_le128(&m_EK[4 * 9]);
const SIMD_8x32 K10 = SIMD_8x32::load_le128(&m_EK[4 * 10]);

while(blocks >= 8) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Perhaps that's a good opportunity to promote my suggestion for the BufferTransformer: which would allow formulating this multi-block-wise transformation like so:

transformer.process_blocks_of<BS*8, BS*2, BS>(overloaded{
  [](std::span<const uint8_t, BS*8> in, std::span<const uint8_t, BS*8> out) {
    // ...
  }
  /* ... */
});

👼

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I need to review that still! VAES (and AES-NI, etc) will actually be a very good testcase for this since the code should compile down to just straightline SIMD instructions

        vaesenc %ymm13, %ymm3, %ymm3    # _413, tmp337, tmp347
        vaesenc %ymm13, %ymm2, %ymm2    # _413, tmp340, tmp348
        vaesenc %ymm13, %ymm1, %ymm1    # _413, tmp343, tmp349
        vaesenc %ymm13, %ymm0, %ymm0    # _413, tmp346, tmp350
...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried the BufferTransformer for AES-NI-128 and here are the inner loops (first for 4 blocks and second for a single block). Looks good to me.

   a49b0:	f3 44 0f 6f 1e       	movdqu (%rsi),%xmm11
   a49b5:	f3 44 0f 6f 66 10    	movdqu 0x10(%rsi),%xmm12
   a49bb:	f3 44 0f 6f 6e 20    	movdqu 0x20(%rsi),%xmm13
   a49c1:	f3 44 0f 6f 76 30    	movdqu 0x30(%rsi),%xmm14
   a49c7:	48 83 c6 40          	add    $0x40,%rsi
   a49cb:	48 83 c1 c0          	add    $0xffffffffffffffc0,%rcx
   a49cf:	66 44 0f ef d8       	pxor   %xmm0,%xmm11
   a49d4:	66 44 0f ef e0       	pxor   %xmm0,%xmm12
   a49d9:	66 44 0f ef e8       	pxor   %xmm0,%xmm13
   a49de:	66 44 0f ef f0       	pxor   %xmm0,%xmm14
   a49e3:	66 44 0f 38 dc d9    	aesenc %xmm1,%xmm11
   a49e9:	66 44 0f 38 dc e1    	aesenc %xmm1,%xmm12
   a49ef:	66 44 0f 38 dc e9    	aesenc %xmm1,%xmm13
   a49f5:	66 44 0f 38 dc f1    	aesenc %xmm1,%xmm14
   a49fb:	66 44 0f 38 dc da    	aesenc %xmm2,%xmm11
   a4a01:	66 44 0f 38 dc e2    	aesenc %xmm2,%xmm12
   a4a07:	66 44 0f 38 dc ea    	aesenc %xmm2,%xmm13
   a4a0d:	66 44 0f 38 dc f2    	aesenc %xmm2,%xmm14
   a4a13:	66 44 0f 38 dc db    	aesenc %xmm3,%xmm11
   a4a19:	66 44 0f 38 dc e3    	aesenc %xmm3,%xmm12
   a4a1f:	66 44 0f 38 dc eb    	aesenc %xmm3,%xmm13
   a4a25:	66 44 0f 38 dc f3    	aesenc %xmm3,%xmm14
   a4a2b:	66 44 0f 38 dc dc    	aesenc %xmm4,%xmm11
   a4a31:	66 44 0f 38 dc e4    	aesenc %xmm4,%xmm12
   a4a37:	66 44 0f 38 dc ec    	aesenc %xmm4,%xmm13
   a4a3d:	66 44 0f 38 dc f4    	aesenc %xmm4,%xmm14
   a4a43:	66 44 0f 38 dc dd    	aesenc %xmm5,%xmm11
   a4a49:	66 44 0f 38 dc e5    	aesenc %xmm5,%xmm12
   a4a4f:	66 44 0f 38 dc ed    	aesenc %xmm5,%xmm13
   a4a55:	66 44 0f 38 dc f5    	aesenc %xmm5,%xmm14
   a4a5b:	66 44 0f 38 dc de    	aesenc %xmm6,%xmm11
   a4a61:	66 44 0f 38 dc e6    	aesenc %xmm6,%xmm12
   a4a67:	66 44 0f 38 dc ee    	aesenc %xmm6,%xmm13
   a4a6d:	66 44 0f 38 dc f6    	aesenc %xmm6,%xmm14
   a4a73:	66 44 0f 38 dc df    	aesenc %xmm7,%xmm11
   a4a79:	66 44 0f 38 dc e7    	aesenc %xmm7,%xmm12
   a4a7f:	66 44 0f 38 dc ef    	aesenc %xmm7,%xmm13
   a4a85:	66 44 0f 38 dc f7    	aesenc %xmm7,%xmm14
   a4a8b:	66 45 0f 38 dc d8    	aesenc %xmm8,%xmm11
   a4a91:	66 45 0f 38 dc e0    	aesenc %xmm8,%xmm12
   a4a97:	66 45 0f 38 dc e8    	aesenc %xmm8,%xmm13
   a4a9d:	66 45 0f 38 dc f0    	aesenc %xmm8,%xmm14
   a4aa3:	66 45 0f 38 dc d9    	aesenc %xmm9,%xmm11
   a4aa9:	66 45 0f 38 dc e1    	aesenc %xmm9,%xmm12
   a4aaf:	66 45 0f 38 dc e9    	aesenc %xmm9,%xmm13
   a4ab5:	66 45 0f 38 dc f1    	aesenc %xmm9,%xmm14
   a4abb:	66 45 0f 38 dd da    	aesenclast %xmm10,%xmm11
   a4ac1:	66 45 0f 38 dd e2    	aesenclast %xmm10,%xmm12
   a4ac7:	66 45 0f 38 dd ea    	aesenclast %xmm10,%xmm13
   a4acd:	66 45 0f 38 dd f2    	aesenclast %xmm10,%xmm14
   a4ad3:	f3 44 0f 7f 1a       	movdqu %xmm11,(%rdx)
   a4ad8:	f3 44 0f 7f 62 10    	movdqu %xmm12,0x10(%rdx)
   a4ade:	f3 44 0f 7f 6a 20    	movdqu %xmm13,0x20(%rdx)
   a4ae4:	f3 44 0f 7f 72 30    	movdqu %xmm14,0x30(%rdx)
   a4aea:	48 83 c2 40          	add    $0x40,%rdx
   a4aee:	48 83 f9 3f          	cmp    $0x3f,%rcx
   a4af2:	0f 87 b8 fe ff ff    	ja     a49b0 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x60>
   a4af8:	48 83 f9 10          	cmp    $0x10,%rcx
   a4afc:	72 63                	jb     a4b61 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x211>
   a4afe:	31 c0                	xor    %eax,%eax
   a4b00:	48 83 f9 0f          	cmp    $0xf,%rcx
   a4b04:	76 5d                	jbe    a4b63 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x213>
   a4b06:	f3 44 0f 6f 1c 06    	movdqu (%rsi,%rax,1),%xmm11
   a4b0c:	66 44 0f ef d8       	pxor   %xmm0,%xmm11
   a4b11:	66 44 0f 38 dc d9    	aesenc %xmm1,%xmm11
   a4b17:	66 44 0f 38 dc da    	aesenc %xmm2,%xmm11
   a4b1d:	66 44 0f 38 dc db    	aesenc %xmm3,%xmm11
   a4b23:	66 44 0f 38 dc dc    	aesenc %xmm4,%xmm11
   a4b29:	66 44 0f 38 dc dd    	aesenc %xmm5,%xmm11
   a4b2f:	66 44 0f 38 dc de    	aesenc %xmm6,%xmm11
   a4b35:	66 44 0f 38 dc df    	aesenc %xmm7,%xmm11
   a4b3b:	66 45 0f 38 dc d8    	aesenc %xmm8,%xmm11
   a4b41:	66 45 0f 38 dc d9    	aesenc %xmm9,%xmm11
   a4b47:	66 45 0f 38 dd da    	aesenclast %xmm10,%xmm11
   a4b4d:	48 83 c1 f0          	add    $0xfffffffffffffff0,%rcx
   a4b51:	f3 44 0f 7f 1c 02    	movdqu %xmm11,(%rdx,%rax,1)
   a4b57:	48 83 c0 10          	add    $0x10,%rax
   a4b5b:	48 83 f9 0f          	cmp    $0xf,%rcx
   a4b5f:	77 9f                	ja     a4b00 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x1b0>

Comment on lines +159 to +173
const SIMD_8x32 K0 = SIMD_8x32::load_le128(&m_DK[4 * 0]);
const SIMD_8x32 K1 = SIMD_8x32::load_le128(&m_DK[4 * 1]);
const SIMD_8x32 K2 = SIMD_8x32::load_le128(&m_DK[4 * 2]);
const SIMD_8x32 K3 = SIMD_8x32::load_le128(&m_DK[4 * 3]);
const SIMD_8x32 K4 = SIMD_8x32::load_le128(&m_DK[4 * 4]);
const SIMD_8x32 K5 = SIMD_8x32::load_le128(&m_DK[4 * 5]);
const SIMD_8x32 K6 = SIMD_8x32::load_le128(&m_DK[4 * 6]);
const SIMD_8x32 K7 = SIMD_8x32::load_le128(&m_DK[4 * 7]);
const SIMD_8x32 K8 = SIMD_8x32::load_le128(&m_DK[4 * 8]);
const SIMD_8x32 K9 = SIMD_8x32::load_le128(&m_DK[4 * 9]);
const SIMD_8x32 K10 = SIMD_8x32::load_le128(&m_DK[4 * 10]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of have an itch to evaluate the possibility of std::array<SIMD_8x32, 11> K = load_le<SIMD_8x32>(...) (and friends). Shouldn't be too complicated to pull that off, I think.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular case is complicated because we're doing a 128 bit load then broadcasting to the high part. It's not clear if this is something generally useful, anyway it hasn't come up before now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular case is complicated because we're doing a 128 bit load then broadcasting to the high part.

Fair, that indeed complicates things a bit. However, in the general case this turns out to be quite easy to integrate into the global load/store logic: With the advantage that all the convenience stuff just works then. I.e. array-based load/store, loading into existing variables, etc., even transparent strong-type unwrapping.

@randombit randombit force-pushed the jack/vaes branch 2 times, most recently from 97d931d to e42d659 Compare August 5, 2024 11:35
On an AMD Zen3 system, results in 50% performance improvement for bulk AES.
@randombit randombit merged commit 13edb92 into master Aug 5, 2024
40 checks passed
@randombit randombit deleted the jack/vaes branch August 5, 2024 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants