-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AVX2-VAES #4287
Conversation
d71aa58
to
88a150b
Compare
const SIMD_8x32 K9 = SIMD_8x32::load_le128(&m_EK[4 * 9]); | ||
const SIMD_8x32 K10 = SIMD_8x32::load_le128(&m_EK[4 * 10]); | ||
|
||
while(blocks >= 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Perhaps that's a good opportunity to promote my suggestion for the BufferTransformer
: which would allow formulating this multi-block-wise transformation like so:
transformer.process_blocks_of<BS*8, BS*2, BS>(overloaded{
[](std::span<const uint8_t, BS*8> in, std::span<const uint8_t, BS*8> out) {
// ...
}
/* ... */
});
👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I need to review that still! VAES (and AES-NI, etc) will actually be a very good testcase for this since the code should compile down to just straightline SIMD instructions
vaesenc %ymm13, %ymm3, %ymm3 # _413, tmp337, tmp347
vaesenc %ymm13, %ymm2, %ymm2 # _413, tmp340, tmp348
vaesenc %ymm13, %ymm1, %ymm1 # _413, tmp343, tmp349
vaesenc %ymm13, %ymm0, %ymm0 # _413, tmp346, tmp350
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried the BufferTransformer
for AES-NI-128 and here are the inner loops (first for 4 blocks and second for a single block). Looks good to me.
a49b0: f3 44 0f 6f 1e movdqu (%rsi),%xmm11
a49b5: f3 44 0f 6f 66 10 movdqu 0x10(%rsi),%xmm12
a49bb: f3 44 0f 6f 6e 20 movdqu 0x20(%rsi),%xmm13
a49c1: f3 44 0f 6f 76 30 movdqu 0x30(%rsi),%xmm14
a49c7: 48 83 c6 40 add $0x40,%rsi
a49cb: 48 83 c1 c0 add $0xffffffffffffffc0,%rcx
a49cf: 66 44 0f ef d8 pxor %xmm0,%xmm11
a49d4: 66 44 0f ef e0 pxor %xmm0,%xmm12
a49d9: 66 44 0f ef e8 pxor %xmm0,%xmm13
a49de: 66 44 0f ef f0 pxor %xmm0,%xmm14
a49e3: 66 44 0f 38 dc d9 aesenc %xmm1,%xmm11
a49e9: 66 44 0f 38 dc e1 aesenc %xmm1,%xmm12
a49ef: 66 44 0f 38 dc e9 aesenc %xmm1,%xmm13
a49f5: 66 44 0f 38 dc f1 aesenc %xmm1,%xmm14
a49fb: 66 44 0f 38 dc da aesenc %xmm2,%xmm11
a4a01: 66 44 0f 38 dc e2 aesenc %xmm2,%xmm12
a4a07: 66 44 0f 38 dc ea aesenc %xmm2,%xmm13
a4a0d: 66 44 0f 38 dc f2 aesenc %xmm2,%xmm14
a4a13: 66 44 0f 38 dc db aesenc %xmm3,%xmm11
a4a19: 66 44 0f 38 dc e3 aesenc %xmm3,%xmm12
a4a1f: 66 44 0f 38 dc eb aesenc %xmm3,%xmm13
a4a25: 66 44 0f 38 dc f3 aesenc %xmm3,%xmm14
a4a2b: 66 44 0f 38 dc dc aesenc %xmm4,%xmm11
a4a31: 66 44 0f 38 dc e4 aesenc %xmm4,%xmm12
a4a37: 66 44 0f 38 dc ec aesenc %xmm4,%xmm13
a4a3d: 66 44 0f 38 dc f4 aesenc %xmm4,%xmm14
a4a43: 66 44 0f 38 dc dd aesenc %xmm5,%xmm11
a4a49: 66 44 0f 38 dc e5 aesenc %xmm5,%xmm12
a4a4f: 66 44 0f 38 dc ed aesenc %xmm5,%xmm13
a4a55: 66 44 0f 38 dc f5 aesenc %xmm5,%xmm14
a4a5b: 66 44 0f 38 dc de aesenc %xmm6,%xmm11
a4a61: 66 44 0f 38 dc e6 aesenc %xmm6,%xmm12
a4a67: 66 44 0f 38 dc ee aesenc %xmm6,%xmm13
a4a6d: 66 44 0f 38 dc f6 aesenc %xmm6,%xmm14
a4a73: 66 44 0f 38 dc df aesenc %xmm7,%xmm11
a4a79: 66 44 0f 38 dc e7 aesenc %xmm7,%xmm12
a4a7f: 66 44 0f 38 dc ef aesenc %xmm7,%xmm13
a4a85: 66 44 0f 38 dc f7 aesenc %xmm7,%xmm14
a4a8b: 66 45 0f 38 dc d8 aesenc %xmm8,%xmm11
a4a91: 66 45 0f 38 dc e0 aesenc %xmm8,%xmm12
a4a97: 66 45 0f 38 dc e8 aesenc %xmm8,%xmm13
a4a9d: 66 45 0f 38 dc f0 aesenc %xmm8,%xmm14
a4aa3: 66 45 0f 38 dc d9 aesenc %xmm9,%xmm11
a4aa9: 66 45 0f 38 dc e1 aesenc %xmm9,%xmm12
a4aaf: 66 45 0f 38 dc e9 aesenc %xmm9,%xmm13
a4ab5: 66 45 0f 38 dc f1 aesenc %xmm9,%xmm14
a4abb: 66 45 0f 38 dd da aesenclast %xmm10,%xmm11
a4ac1: 66 45 0f 38 dd e2 aesenclast %xmm10,%xmm12
a4ac7: 66 45 0f 38 dd ea aesenclast %xmm10,%xmm13
a4acd: 66 45 0f 38 dd f2 aesenclast %xmm10,%xmm14
a4ad3: f3 44 0f 7f 1a movdqu %xmm11,(%rdx)
a4ad8: f3 44 0f 7f 62 10 movdqu %xmm12,0x10(%rdx)
a4ade: f3 44 0f 7f 6a 20 movdqu %xmm13,0x20(%rdx)
a4ae4: f3 44 0f 7f 72 30 movdqu %xmm14,0x30(%rdx)
a4aea: 48 83 c2 40 add $0x40,%rdx
a4aee: 48 83 f9 3f cmp $0x3f,%rcx
a4af2: 0f 87 b8 fe ff ff ja a49b0 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x60>
a4af8: 48 83 f9 10 cmp $0x10,%rcx
a4afc: 72 63 jb a4b61 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x211>
a4afe: 31 c0 xor %eax,%eax
a4b00: 48 83 f9 0f cmp $0xf,%rcx
a4b04: 76 5d jbe a4b63 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x213>
a4b06: f3 44 0f 6f 1c 06 movdqu (%rsi,%rax,1),%xmm11
a4b0c: 66 44 0f ef d8 pxor %xmm0,%xmm11
a4b11: 66 44 0f 38 dc d9 aesenc %xmm1,%xmm11
a4b17: 66 44 0f 38 dc da aesenc %xmm2,%xmm11
a4b1d: 66 44 0f 38 dc db aesenc %xmm3,%xmm11
a4b23: 66 44 0f 38 dc dc aesenc %xmm4,%xmm11
a4b29: 66 44 0f 38 dc dd aesenc %xmm5,%xmm11
a4b2f: 66 44 0f 38 dc de aesenc %xmm6,%xmm11
a4b35: 66 44 0f 38 dc df aesenc %xmm7,%xmm11
a4b3b: 66 45 0f 38 dc d8 aesenc %xmm8,%xmm11
a4b41: 66 45 0f 38 dc d9 aesenc %xmm9,%xmm11
a4b47: 66 45 0f 38 dd da aesenclast %xmm10,%xmm11
a4b4d: 48 83 c1 f0 add $0xfffffffffffffff0,%rcx
a4b51: f3 44 0f 7f 1c 02 movdqu %xmm11,(%rdx,%rax,1)
a4b57: 48 83 c0 10 add $0x10,%rax
a4b5b: 48 83 f9 0f cmp $0xf,%rcx
a4b5f: 77 9f ja a4b00 <_ZNK5Botan7AES_12816hw_aes_encrypt_nEPKhPhm+0x1b0>
const SIMD_8x32 K0 = SIMD_8x32::load_le128(&m_DK[4 * 0]); | ||
const SIMD_8x32 K1 = SIMD_8x32::load_le128(&m_DK[4 * 1]); | ||
const SIMD_8x32 K2 = SIMD_8x32::load_le128(&m_DK[4 * 2]); | ||
const SIMD_8x32 K3 = SIMD_8x32::load_le128(&m_DK[4 * 3]); | ||
const SIMD_8x32 K4 = SIMD_8x32::load_le128(&m_DK[4 * 4]); | ||
const SIMD_8x32 K5 = SIMD_8x32::load_le128(&m_DK[4 * 5]); | ||
const SIMD_8x32 K6 = SIMD_8x32::load_le128(&m_DK[4 * 6]); | ||
const SIMD_8x32 K7 = SIMD_8x32::load_le128(&m_DK[4 * 7]); | ||
const SIMD_8x32 K8 = SIMD_8x32::load_le128(&m_DK[4 * 8]); | ||
const SIMD_8x32 K9 = SIMD_8x32::load_le128(&m_DK[4 * 9]); | ||
const SIMD_8x32 K10 = SIMD_8x32::load_le128(&m_DK[4 * 10]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of have an itch to evaluate the possibility of std::array<SIMD_8x32, 11> K = load_le<SIMD_8x32>(...)
(and friends). Shouldn't be too complicated to pull that off, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular case is complicated because we're doing a 128 bit load then broadcasting to the high part. It's not clear if this is something generally useful, anyway it hasn't come up before now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular case is complicated because we're doing a 128 bit load then broadcasting to the high part.
Fair, that indeed complicates things a bit. However, in the general case this turns out to be quite easy to integrate into the global load/store logic: With the advantage that all the convenience stuff just works then. I.e. array-based load/store, loading into existing variables, etc., even transparent strong-type unwrapping.
97d931d
to
e42d659
Compare
On an AMD Zen3 system, results in 50% performance improvement for bulk AES.
On an AMD Zen3 system, results in 50% performance improvement for bulk AES.