AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. #72

JonathanHenson · 2023-07-19T00:38:44Z

scratch space for this:

#68

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Avx512 support

Add sse42 avx512_intrinsics support

… working yet.

…, turn it off once we know what's supported where.

…ilable.

… busted.

bdonlan · 2023-07-19T03:21:17Z

source/intel/intrin/crc32c_sse42_avx512.c

+ * length must be at least 256, and a multiple of 64. Based on:
+ *
+ * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
+ *  V. Gopal, E. Ozturk, et al., 2009, http://intel.ly/2ySEwL0


This citation is a broken link, do we have an updated one? There's a few copies floating around but I think we'd prefer an authoritative source if possible.

Authoritative source for the paper: http://download.intel.com/design/intarch/papers/323102.pdf

bdonlan · 2023-07-19T03:23:39Z

source/intel/intrin/crc32c_sse42_avx512.c

+    static zalign_8 k1k2[8] = 
+        {0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86};
+
+    static zalign_8 k3k4[8] =
+        {0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8};
+    static zalign_2 k5k6[2] = {0xf20c0dfe, 0x14cd00bd6};
+    static zalign_2 k7k8[2] = {0xdd45aab8, 0x000000000};
+    static zalign_2 poly[2] = {0x105ec76f1, 0xdea713f1};


These should probably be static const.

bdonlan · 2023-07-19T03:27:04Z

source/intel/intrin/crc32c_sse42_avx512.c

+        y5 = _mm512_loadu_si512((__m512i *)(input + 0x00));
+        y6 = _mm512_loadu_si512((__m512i *)(input + 0x40));
+        y7 = _mm512_loadu_si512((__m512i *)(input + 0x80));
+        y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0));


Would it be better to move the loads up to the start of the loop, since they have potentially longer latency than clmul? (depending on whether the data is cached)

Edit: I see now this is effectively front-loading the loads from L53-L56, never mind

Also, was prefetching considered?

bdonlan · 2023-07-19T04:19:25Z

source/intel/intrin/crc32c_sse42_avx512.c

+    /*
+     * Fold 512-bits to 384-bits.
+     */
+    a0 = _mm_load_si128((__m128i *)k5k6);
+
+    a1 = _mm512_extracti32x4_epi32(x1, 0);
+    a2 = _mm512_extracti32x4_epi32(x1, 1);
+
+    a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
+
+    a1 = _mm_xor_si128(a1, a3);
+    a1 = _mm_xor_si128(a1, a2);
+
+    /*
+     * Fold 384-bits to 256-bits.
+     */
+    a2 = _mm512_extracti32x4_epi32(x1, 2);
+    a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
+    a1 = _mm_xor_si128(a1, a3);
+    a1 = _mm_xor_si128(a1, a2);
+
+    /*
+     * Fold 256-bits to 128-bits.
+     */
+    a2 = _mm512_extracti32x4_epi32(x1, 3);
+    a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
+    a1 = _mm_xor_si128(a1, a3);
+    a1 = _mm_xor_si128(a1, a2);
+
+    /*
+     * Fold 128-bits to 64-bits.
+     */
+    a2 = _mm_clmulepi64_si128(a1, a0, 0x10);
+    a3 = _mm_setr_epi32(~0, 0, ~0, 0);
+    a1 = _mm_srli_si128(a1, 8);
+    a1 = _mm_xor_si128(a1, a2);
+
+    a0 = _mm_loadl_epi64((__m128i *)k7k8);
+    a2 = _mm_srli_si128(a1, 4);
+    a1 = _mm_and_si128(a1, a3);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_xor_si128(a1, a2);


I wonder if this sequence can be improved. This sequence contends a lot on port 5 on alder lake architectures (which handles CLMUL as well as EXTRACTI32X4) and also has a lot of interdependent instruction. In particular, clmul and extract operations both run on port 5 with latency 3 (for alder lake), so avoiding serial sequences of these instructions would be preferable.

One way to reduce this latency would be to shift all of the 64-bit lanes into their final position in one go, before doing any XORs to merge them together. This means that we'd have one phase of CLMULs and XORs to shift into a series of 128-bit lanes within our 512-bit register, then a phase of further EXTRACT/XOR reduction to 128 bits before going into Barret reduction.

Sketch would look a bit like:

x0 = _mm512_load_si512((__m512i *)ksplat); // computing ksplat is left as an exercise for the // reader // Shift all 128-bit lanes into their final positions x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // p5, lat=3, 1/tp=1 x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); x1 = __mm512_xor_si512(x1, x2); // Now merge down to 128 bits. Here, again, EXTRACT (p5) and XOR (p05) should run in parallel // as much as possible. a0 = __mm512_extracti32x4_epi32(x1, 0); // p5, lat=3, 1/tp=1 a1 = __mm512_extracti32x4_epi32(x1, 1); a2 = __mm512_extracti32x4_epi32(x1, 2); a3 = __mm512_extracti32x4_epi32(x1, 3); a0 = __mm_xor_si128(a0, a1); // p015, lat=1, 1/tp=0.33 a0 = __mm_xor_si128(a0, a2); a0 = __mm_xor_si128(a0, a3); // continue with barret reduction

While this would need to be benchmarked, I'd expect it to have significantly lower latency due to a reduction of dependencies between high-latency instructions. The main downside is ksplat will be larger, resulting in larger cache footprint (but, we save on I-cache size, so maybe it's a wash?)

it is definitely possible to fold from 512 to 256, then from 256 to 128 bits (less steps than above) - example fold from 256 to 128 is here https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L136

here is example of folding 8x128-bits into 1x128 bits using different constants in one step
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L184C1-L184C1

bdonlan · 2023-07-19T04:26:04Z

source/intel/intrin/crc32c_sse42_avx512.c

+    /*
+     * Fold into 512-bits.
+     */
+    x0 = _mm512_load_si512((__m512i *)k3k4);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x2);
+    x1 = _mm512_xor_si512(x1, x5);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x3);
+    x1 = _mm512_xor_si512(x1, x5);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x4);
+    x1 = _mm512_xor_si512(x1, x5);


We might want to have a path that skips forward to the loop at L121 without doing this reduction for inputs smaller than 256 bytes. This should be free as we have to compare the input length anyway before the loop at L68

yep, note that initial crc value needs to be handled separately then

bdonlan · 2023-07-19T04:37:01Z

source/intel/crc_hw.c

+
+    /* For small input, forget about alignment checks - simply compute the CRC32c one byte at a time */
+    if (length < (int)sizeof(slice_ptr_int_type)) {
+        while (length-- > 0) {


There are 16/32/64-bit variants of the CRC32 op that we should probably take advantage of.

This will use the 64 bit version on x64 and 32bit on x86. That’s what those typedefs and defines in the private header are for.

This will use the 64 bit version on x64 and 32bit on x86. That’s what those typedefs and defines in the private header are for.

Nevermind, I was at the wrong place in the file. This branch is for tiny inputs, and it doesn’t seem worth optimizing to me.

I would be interested in performance for small inputs as well - but I'd agree that's more likely to be dwarfed by surrounding code for sure.

bdonlan · 2023-07-19T04:37:08Z

source/intel/intrin/crc32c_sse42_avx512.c

+    while (length >= 64) {
+        x2 = _mm512_loadu_si512((__m512i *)input);
+
+        x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+        x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+        x1 = _mm512_xor_si512(x1, x2);
+        x1 = _mm512_xor_si512(x1, x5);
+
+        input += 64;
+        length -= 64;
+    }


I wonder if there's any opportunity to use mask registers to perform a sub-64-byte CRC operation? This would remove the need for a byte-by-byte loop, replacing it with a _mm512_mask_load8_epi8 operation, an extra round of CLMUL, and some ancillary scalar logic to compute the right mask value and select the right set of shift constants from a static table.

For this to work, we'd need to prepare a table of multiplication constants for various shift lengths, which would increase our cache footprint. There's probably a tradeoff here - maybe have a table for each 64-bit increment and use the CRC32 instruction for the remaining bytes?

This solution may be possible, loaded message has to match right constants as well. ISA-L and ipsec-mb crc implementations have special code path for less than 256 bytes (then multiple cases considered there) - essentially it processed the message in 16 byte chunks

bdonlan · 2023-07-19T04:38:40Z

tests/crc_test.c

+
+    struct aws_byte_buf avx_buf;
+    /* enough for two avx512 runs */
+    aws_byte_buf_init(&avx_buf, allocator, 512);
+    aws_device_random_buffer(&avx_buf);
+
+    uint32_t crc = aws_checksums_crc32c_sw(avx_buf.buffer, (int)avx_buf.len, 0);
+    uint32_t hw_crc = aws_checksums_crc32c_hw(avx_buf.buffer, (int)avx_buf.len, 0);
+
+    aws_byte_buf_clean_up(&avx_buf);
+    ASSERT_UINT_EQUALS(hw_crc, crc);


This should test various sizes of unaligned buffers as well. I'd prefer to test every buffer size from 0 to 512+1 bytes, and have some tests that deliberatly unalign the buffer, and/or put it just before a page boundary where the next page is inaccessible.

bdonlan · 2023-07-19T04:46:55Z

source/intel/intrin/crc32c_sse42_avx512.c

+    x0 = _mm512_load_si512((__m512i *)k3k4);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x2);
+    x1 = _mm512_xor_si512(x1, x5);


The values that start out in x1 are effectively being multiplied three times by k3k4. (Similarly, x2 is multiplied twice, x3 once). Would we be able to reduce the dependency chain here by having three premultiplied constants instead of one, and XORing at the end?

Note that having three premultiplied constants does increase cache usage - we could avoid this by multiplying k3k4 multiple times into each lane instead. That is:

x4 = MUL128(k3k4, k3k4) // CLMUL * 2, XOR; lat=5, 1/tp=2 (contends on port 5 for CLMUL) x1 = MUL128(x1, k3k4) x3 = MUL128(x3, k3k4) x2 = MUL128(x2, x4) // waits 2 cycles for CLMUL latency x1 = MUL128(x1, x4) // waits 2 cycles for CLMUL latency x1 = (x2 ^ x3) ^ x1 // the first XOR can execute in parallel with the last MUL128

While this requires more multiplications (10 CLMULs vs 6), because CLMULs have a high latency, this approach might end up better utilizing the CLMUL capacity of the CPU.

On further thought the premultiplication trick might not work, since we're not doing a real 128-bit multiplication operation. It's probably simpler to just load multiple constants... though, it would be interesting to see what it would cost to compute this constant using a round of barett reduction, and whether this could run in parallel with the memory loads going on in the first part of this function.

bdonlan · 2023-07-19T05:40:20Z

source/intel/crc_hw.c

+    if (detected_sse42 && detected_clmul) {
+        return aws_checksums_crc32c_sse42(input, length, crc);
+    }


Will the sse42 impl be worth invoking after we've finished processing the lead portion using the avx512 implementation?

Thank you for the comments. I will review them closely.

For full disclosure - I re-used the avx512 intrinsic implementation from https://chromium.googlesource.com/chromium/src/third_party/zlib/+/b890619bc2b193b8fbe9c1c053f4cd19a9791d92/crc32_simd.c

but recomputed constants for crc32c polynomial :)

@pbadari I've got the build surgery done and tests passing if you'd like to work on the avx512 comments from bdonlan@

you'll need the AVX512 branch from aws-c-common until we merge it

Yes. I am reviewing the avx512 comments from bdonlan

Jonathan, Most of the review comments (from Donlan) for the AVX512 code is further performance improvements which require careful re-write/proto-type and performance analysis. I reached out to our area expert for his input. I am wondering if we can merge the current patch and update it further when we have new code ready? Please let me know.

Unfortunately, AVX512 has a history of being an incredibly risky addition to an already functioning (possibly already running really hot) system. Various chipset versions introduce timing issues for side-channel attacks as well as side-effects to other processes sharing the CPU. So we're going to have to run a lot of tests before we can just run this in production anyways, and we'd like the code to be close to structured in the actual final algorithm before running that analysis.

This particular code runs for a lot of S3 PUT and GET operations across multiple SDKs, so any side-effects would most likely be felt across a large blast radius.

We are submitting AVX-512 based implementation of crc32 to zlib-chromium as well and want to make sure that both code contributions are licensed appropriately. For now, can we withdraw the patch/submission. I will resolve the issue and re-submit.

tkanteck

Great work on crc performance improvements using new instructions. I am adding a few pointers and comments based on my previous experience with CRC implementations.

tkanteck · 2023-07-28T13:55:46Z

source/intel/intrin/crc32c_sse42_avx512.c

+        x1 = _mm512_xor_si512(x1, x5);
+        x2 = _mm512_xor_si512(x2, x6);
+        x3 = _mm512_xor_si512(x3, x7);
+        x4 = _mm512_xor_si512(x4, x8);
+
+        x1 = _mm512_xor_si512(x1, y5);
+        x2 = _mm512_xor_si512(x2, y6);
+        x3 = _mm512_xor_si512(x3, y7);
+        x4 = _mm512_xor_si512(x4, y8);


ternary logic operation code 0x96 (3-way xor) can be used to reduce number of xor's here (and other places below)

Recent compiler versions would seem to come up with some ternary logic instructions on their own. For what it is worth, today I came up with an intuitive way of defining the 0x96:

class ternary { static constexpr uint8_t A= 0b11110000; static constexpr uint8_t B= 0b11001100; static constexpr uint8_t C= 0b10101010; public: static constexpr uint8_t XOR3= A ^ B ^ C; static constexpr uint8_t XNOR3= uint8_t(~(A ^ B ^ C)); static constexpr uint8_t XOR2_AND= (A ^ B) & C; }; /** @return a^b^c */ static inline __m128i xor3_128(__m128i a, __m128i b, __m128i c) { return _mm_ternarylogic_epi64(a, b, c, ternary::XOR3); }

In last-millennium C, you might write something like the following:

#define TERNARY_A 0xf0 #define TERNARY_B 0xcc #define TERNARY_C 0xaa #define TERNARY_XOR3 (TERNARY_A ^ TERNARY_B ^ TERNARY_C)

tkanteck · 2023-07-28T14:00:09Z

source/intel/intrin/crc32c_sse42_avx512.c

+    /*
+     * Fold into 512-bits.
+     */
+    x0 = _mm512_load_si512((__m512i *)k3k4);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x2);
+    x1 = _mm512_xor_si512(x1, x5);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x3);
+    x1 = _mm512_xor_si512(x1, x5);
+
+    x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+    x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+    x1 = _mm512_xor_si512(x1, x4);
+    x1 = _mm512_xor_si512(x1, x5);


yep, note that initial crc value needs to be handled separately then

tkanteck · 2023-07-28T14:07:58Z

source/intel/intrin/crc32c_sse42_avx512.c

+    while (length >= 64) {
+        x2 = _mm512_loadu_si512((__m512i *)input);
+
+        x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00);
+        x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
+        x1 = _mm512_xor_si512(x1, x2);
+        x1 = _mm512_xor_si512(x1, x5);
+
+        input += 64;
+        length -= 64;
+    }


This solution may be possible, loaded message has to match right constants as well. ISA-L and ipsec-mb crc implementations have special code path for less than 256 bytes (then multiple cases considered there) - essentially it processed the message in 16 byte chunks

tkanteck · 2023-07-28T14:14:22Z

source/intel/intrin/crc32c_sse42_avx512.c

+    /*
+     * Fold 512-bits to 384-bits.
+     */
+    a0 = _mm_load_si128((__m128i *)k5k6);
+
+    a1 = _mm512_extracti32x4_epi32(x1, 0);
+    a2 = _mm512_extracti32x4_epi32(x1, 1);
+
+    a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
+
+    a1 = _mm_xor_si128(a1, a3);
+    a1 = _mm_xor_si128(a1, a2);
+
+    /*
+     * Fold 384-bits to 256-bits.
+     */
+    a2 = _mm512_extracti32x4_epi32(x1, 2);
+    a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
+    a1 = _mm_xor_si128(a1, a3);
+    a1 = _mm_xor_si128(a1, a2);
+
+    /*
+     * Fold 256-bits to 128-bits.
+     */
+    a2 = _mm512_extracti32x4_epi32(x1, 3);
+    a3 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x11);
+    a1 = _mm_xor_si128(a1, a3);
+    a1 = _mm_xor_si128(a1, a2);
+
+    /*
+     * Fold 128-bits to 64-bits.
+     */
+    a2 = _mm_clmulepi64_si128(a1, a0, 0x10);
+    a3 = _mm_setr_epi32(~0, 0, ~0, 0);
+    a1 = _mm_srli_si128(a1, 8);
+    a1 = _mm_xor_si128(a1, a2);
+
+    a0 = _mm_loadl_epi64((__m128i *)k7k8);
+    a2 = _mm_srli_si128(a1, 4);
+    a1 = _mm_and_si128(a1, a3);
+    a1 = _mm_clmulepi64_si128(a1, a0, 0x00);
+    a1 = _mm_xor_si128(a1, a2);


it is definitely possible to fold from 512 to 256, then from 256 to 128 bits (less steps than above) - example fold from 256 to 128 is here https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L136

here is example of folding 8x128-bits into 1x128 bits using different constants in one step
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L184C1-L184C1

pablodelara · 2023-08-01T14:02:18Z

source/intel/crc_hw.c

+#if defined(AWS_HAVE_AVX512_INTRINSICS) && (INTPTR_MAX == INT64_MAX)
+    int chunk_size = length & ~63;
+
+    if (detected_avx512 && detected_vpclmulqdq && detected_clmul) {


I'd say detect_clmul is not needed, as it will always be true if detected_vpclmulqdq is true

pablodelara · 2023-08-01T14:05:28Z

source/intel/intrin/crc32c_sse42_avx512.c

+
+/*
+ * crc32c_avx512(): compute the crc32c of the buffer, where the buffer
+ * length must be at least 256, and a multiple of 64. Based on:


Forcing the length to be multiple of 64 bytes looks too restrictive. K-masks can be used to deal with buffer lengths that are not multiple of 64 bytes. Any reason for this condition?

pbadari · 2023-08-08T16:04:24Z

Addressed most of the above comments and created a new PR to merge/review changes:

#73

JonathanHenson · 2023-08-08T16:50:49Z

Addressed most of the above comments and created a new PR to merge/review changes:

#73

Internally, this PR inspired lots of convo and resulted in some more code being sent to contribute here for avx512 crc32 as well. If you created the PR from your own fork it won’t work, because the CI needs creds from someone authorized for our AWS account to run. If so I’ll move it to a branch and apply the new additional changes on top as appropriate. Thanks for this contribution!

pbadari · 2023-08-08T18:33:26Z

Addressed most of the above comments and created a new PR to merge/review changes:
#73

Internally, this PR inspired lots of convo and resulted in some more code being sent to contribute here for avx512 crc32 as well. If you created the PR from your own fork it won’t work, because the CI needs creds from someone authorized for our AWS account to run. If so I’ll move it to a branch and apply the new additional changes on top as appropriate. Thanks for this contribution!

Changes in #73 are for the AVX512 branch.

And also, looking at the code crc32 version is NOT (intel) optimized. Is it not used heavily? Do we need to provide optimized version for it too? If so, we may want to have a common implementation for both (crc32 and crc32c) extracting the constants out. Please Let me know.

pbadari · 2023-08-11T16:36:28Z

Jonathan, any update on the new branch with all the changes for us to review/analyze?

JonathanHenson · 2023-08-11T17:32:32Z

Jonathan, any update on the new branch with all the changes for us to review/analyze?

It's the next item on my list. Hopefully I'll be able to pick it back up later today

BadariP · 2023-08-28T12:18:24Z

i now have common code with avx512 support for crc32 and crc32c. I will wait for the previous changes to be merged to the branch before submitting the new code.

BadariP · 2023-09-05T17:04:10Z

Jonathan, any update?

JonathanHenson · 2023-09-07T17:12:02Z

Jonathan, any update?

Yes, sorry, i'm still finishing up some other high priority work and will switch back to this as soon as I possibly can.

BadariP · 2023-10-21T11:24:20Z

Jonathan, any update?

BadariP · 2023-12-01T18:27:28Z

Hi Jonathan,

Can you please let me know next steps? Anything I can do to help?

#73)

JonathanHenson · 2024-01-01T02:36:22Z

sorry just now getting back to this to try and finish up my work for the year and I noticed you just wanted pr #73 merged. I'm sorry for the delay. I got pulled of into other work. I've merged it over to this branch.

raviagiri · 2024-01-01T05:39:58Z

Thanks Jonathan, wish you a happy new year. We would like to setup time with you sometime in Jan to discuss priorities for our planned contributions to AWS CRT. Regards, -- Ravi

…

________________________________ From: Jonathan M. Henson ***@***.***> Sent: Sunday, December 31, 2023 6:36:49 PM To: awslabs/aws-checksums ***@***.***> Cc: Giri, Ravi A ***@***.***>; Comment ***@***.***> Subject: Re: [awslabs/aws-checksums] AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. (PR #72) sorry just now getting back to this to try and finish up my work for the year and I noticed you just wanted pr #73<#73> merged. I'm sorry for the delay. I got pulled of into other work. I've merged it over to this branch. — Reply to this email directly, view it on GitHub<#72 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASX3HQSEUMBELQDGHOKBXQDYMIOLDAVCNFSM6AAAAAA2PECPCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGEYTINZSGI>. You are receiving this because you commented.Message ID: ***@***.***>

pbadari · 2024-04-16T17:01:16Z

Jonathan, please let me know what I can do to help to push this work along.

dr-m · 2024-05-03T19:10:26Z

This week, I finished translating some hand-written NASM code from https://github.com/intel/intel-ipsec-mb/ into C++11 and created https://github.com/dr-m/crc32_simd/. There are no alignment or buffer size restrictions in that implementation. The main loop processes 256 bytes at a time. Any leftovers from the smaller special cases (multiples of 128, 64, 32, 64, 32, 16 bytes) will be handled by a masked vmovdqu that will load 1 to 15 bytes. I made some optimizations in my translation, such as making a negation of the crc value part of the final vternlogq operation. When compiled with GCC 13.2, the "reflected" polynomial variant crc32_avx512<true> is more than 200 bytes shorter than the hand-written crc32_refl_by16_vclmul_avx512. The compiler seems to be doing a good job at fusing machine instructions.

On a system where I tested the crc32_avx512<true> today, it computed a CRC-32C on a 1-gigabyte buffer in 52 milliseconds of user CPU time. Our previous fastest x86-64 implementation of my_crc32c in https://github.com/MariaDB/server/ would use 160 milliseconds. That code is using AVX xmm registers, 64×64→128 pclmul, and SSE 4.2 crc32 instructions.

Should I create a separate pull request for this?

pbadari · 2024-05-03T20:12:02Z

This week, I finished translating some hand-written NASM code from https://github.com/intel/intel-ipsec-mb/ into C++11 and created https://github.com/dr-m/crc32_simd/. There are no alignment or buffer size restrictions in that implementation. The main loop processes 256 bytes at a time. Any leftovers from the smaller special cases (multiples of 128, 64, 32, 64, 32, 16 bytes) will be handled by a masked vmovdqu that will load 1 to 15 bytes. I made some optimizations in my translation, such as making a negation of the crc value part of the final vternlogq operation. When compiled with GCC 13.2, the "reflected" polynomial variant crc32_avx512<true> is more than 200 bytes shorter than the hand-written crc32_refl_by16_vclmul_avx512. The compiler seems to be doing a good job at fusing machine instructions.

On a system where I tested the crc32_avx512<true> today, it computed a CRC-32C on a 1-gigabyte buffer in 52 milliseconds of user CPU time. Our previous fastest x86-64 implementation of my_crc32c in https://github.com/MariaDB/server/ would use 160 milliseconds. That code is using AVX xmm registers, 64×64→128 pclmul, and SSE 4.2 crc32 instructions.

Should I create a separate pull request for this?

I have not reviewed your changes, but did a quick test and ran into this.

# ./test/test_crc32
Testing AVX512+VPCLMULQDQ: CRC-32CSegmentation fault (core dumped)

dr-m · 2024-05-09T06:38:39Z

I have not reviewed your changes, but did a quick test and ran into this.
# ./test/test_crc32
Testing AVX512+VPCLMULQDQ: CRC-32CSegmentation fault (core dumped)

Thank you for taking a look at it, @pbadari. Can you provide some more details, such as the CMAKE_CXX_COMPILER and CMAKE_CXX_FLAGS from the CMakeCache.txt, and a disassembly of the function around the crash? I had tested the code generated with GCC 11, 13, and a few versions of clang between 8 and 18, using two different systems. I only had time limited access to those systems. Intel SDE could be an option for further debugging.

pbadari · 2024-05-09T16:39:11Z

I have not reviewed your changes, but did a quick test and ran into this.
# ./test/test_crc32
Testing AVX512+VPCLMULQDQ: CRC-32CSegmentation fault (core dumped)
Thank you for taking a look at it, @pbadari. Can you provide some more details, such as the CMAKE_CXX_COMPILER and CMAKE_CXX_FLAGS from the CMakeCache.txt, and a disassembly of the function around the crash? I had tested the code generated with GCC 11, 13, and a few versions of clang between 8 and 18, using two different systems. I only had time limited access to those systems. Intel SDE could be an option for further debugging.

I was able to get the test working with this change:

     for (size_t i = sizeof funcs / sizeof(*funcs); i--; ) {
       fputs(funcs[i].name, stderr);
-      if (size_t s = test_buf(buf, sizeof buf, funcs[i].c[1], funcs[i].c[2])) {
+      if (size_t s = test_buf(buf, sizeof buf, funcs[i].c[0], funcs[i].c[1])) {
         fprintf(stderr, "(failed at %zu)", s);
         status = EXIT_FAILURE;
       }

I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project?

DmitriyMusatkin · 2024-05-09T20:21:43Z

Unfortunately, Jonathan is no longer with the CRT team. We'll sync up internally and see if we can find someone else to push those changes along. I spoke with Jonathan briefly and he mentioned that all the latest changes are in following 2 prs and this pr can be close.
#81
#79

dr-m · 2024-05-10T04:09:32Z

I was able to get the test working with this change:

Thank you. I made this off-by-one mistake in dr-m/crc32_simd@ac55c31 when I implemented an option to invoke the CRC functions on input files. Apparently I forgot to rerun the test program without any arguments, on an AVX512 capable machine where that part of the code would run. I remember that I did run it with some arguments.

I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project?

Yes, I would love to. I looked around and did not find any other SIMD optimized CRC library for C or C++ that would target multiple ISA (other than as part of a bigger project, such as the mysys/crc32 subdirectory of MariaDB Server). For example, boost::crc is SISD only. I see that this library is C, so some minor refactoring would be needed.

Do you have any use for the non-reflected polynomials? If not, the conversion should be rather trivial.

pbadari · 2024-05-10T17:52:11Z

Unfortunately, Jonathan is no longer with the CRT team. We'll sync up internally and see if we can find someone else to push those changes along. I spoke with Jonathan briefly and he mentioned that all the latest changes are in following 2 prs and this pr can be close. #81 #79

Along with those PRs, we need following to be merged:

#88: incorrect checksum due to extra bit flip.
#89: add support for crc32 also

pbadari · 2024-05-10T17:54:21Z

I was able to get the test working with this change:

Thank you. I made this off-by-one mistake in dr-m/crc32_simd@ac55c31 when I implemented an option to invoke the CRC functions on input files. Apparently I forgot to rerun the test program without any arguments, on an AVX512 capable machine where that part of the code would run. I remember that I did run it with some arguments.

I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project?

Yes, I would love to. I looked around and did not find any other SIMD optimized CRC library for C or C++ that would target multiple ISA (other than as part of a bigger project, such as the mysys/crc32 subdirectory of MariaDB Server). For example, boost::crc is SISD only. I see that this library is C, so some minor refactoring would be needed.

Yes. Refactoring to C would be nice.

Do you have any use for the non-reflected polynomials? If not, the conversion should be rather trivial.

aws-checksums polymonials are reflected. So, if you can provide only reflected version, would be nice.

This implementation is based on crc32_refl_by16_vclmul_avx512 in https://github.com/intel/intel-ipsec-mb/ with some optimizations. Some of the code is based on awslabs#72.

dr-m · 2024-05-11T19:28:23Z

I filed #90 for my version.

This implementation is based on crc32_refl_by16_vclmul_avx512 in https://github.com/intel/intel-ipsec-mb/ with some optimizations. Changes to CMakeLists.txt and source/intel/asm/crc32c_sse42_asm.c are based on awslabs#72. This also fixes a bug in aws_checksums_crc32c_hw() when 128-bit pclmul is not available. crc_intrin_fn was being invoked on bytes instead of 32-bit or 64-bit words. The aws-checksums-tests was extended to cover all SIMD implementations. Note: The availability of the Intel CRC-32C instructions is checked as part of testing AWS_CPU_FEATURE_SSE_4_2. Both ISA extensions were introduced in the Intel Nehalem microarchitecture. For compiling this, https://github.com/awslabs/aws-c-common must be installed and CMAKE_MODULE_PATH must point to it, e.g.: cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake. The AWS_CPU_FEATURE_AVX512 currently only checks for AVX512F and not other features that this implementation depends on: AVX512VL, AVX512BW, AVX512DQ. According to https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 there currently exist no CPUs that would support VPCLMULQDQ without supporting all those AVX512 features. The architecture target evex512 is something that was introduced as mandatory in GCC 14 and clang 18 as part of introducing the AVX10.1-512 target, which basically is a new name for a number of AVX512 features. Older compilers do not recognize this target, but they do emit EVEX encoded instructions.

javazque and others added 15 commits June 22, 2023 13:51

Added CRC32C AVX512 support.

c6248e0

Fixed routine name to indicate crc32c

cf22bca

Merge pull request #1 from pbadari/avx512_support

def3a68

Avx512 support

Add sse42 avx512_intrinsics support

375fa35

Merge pull request #2 from pbadari/sse42_avx512_intrinsics

9e18b50

Add sse42 avx512_intrinsics support

Merge branch 'main' into main

469ef71

Refactoring work for the AVX512 code path. Testing shows it not quite…

2eb5578

… working yet.

Keep the naive avx512 path on for figuring out codebuild capabilities…

837d5a1

…, turn it off once we know what's supported where.

Fix build and do correct cpu feature detection.

2289c96

fix 32-bit builds and builds that need to work without intrinsics ava…

ee3e5da

…ilable.

Not sure how the avx512 code got called. hopefully coedebuild is just…

005ed7c

… busted.

Found why the wrong build files were being used at least.

39094d4

Make test pass when it passes.

d4ffdc1

Try it again.

bf79936

fix leftover symbol collision.

1e24d06

bdonlan reviewed Jul 19, 2023

View reviewed changes

JonathanHenson added 5 commits July 19, 2023 11:46

Added more compile gates and assertions.

907e721

Fix osx build.

5ab0046

make the bitflips uniform.

ca43c51

add additional runtime cpuid check and run formatter.

28dde8b

work around nasty bitflipping logic.

a00a8e3

tkanteck reviewed Jul 28, 2023

View reviewed changes

pablodelara reviewed Aug 1, 2023

View reviewed changes

Addressed review comments, use ternary logic instructions and optimiz… (

5138407

#73)

This was referenced Feb 6, 2024

Sse42 avx512 intrinsics support #71

Closed

Add AVX512 support for CRC32c implementation (on Intel platforms) #68

Closed

Added CRC32C AVX512 support. #67

Closed

jmklix linked an issue Feb 22, 2024 that may be closed by this pull request

s_has_vpclmulqdq() is not checking correct bit to detect VPCLMULQDQ awslabs/aws-c-common#1084

Closed

jmklix mentioned this pull request Feb 22, 2024

s_has_vpclmulqdq() is not checking correct bit to detect VPCLMULQDQ awslabs/aws-c-common#1084

Closed

jmklix marked this pull request as draft April 8, 2024 21:53

jmklix added the needs-review label Apr 8, 2024

jmklix mentioned this pull request May 8, 2024

avx512 changes require -mavx512vl compile flags awslabs/aws-c-common#1109

Open

dr-m mentioned this pull request May 11, 2024

AVX512 and VPCLMULQDQ based CRC-32 and CRC-32C #90

Open

AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. #72

Are you sure you want to change the base?

AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. #72

Conversation

JonathanHenson commented Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdonlan Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdonlan Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdonlan Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdonlan Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonathanHenson Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkanteck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbadari commented Aug 8, 2023

JonathanHenson commented Aug 8, 2023 • edited Loading

pbadari commented Aug 8, 2023

pbadari commented Aug 11, 2023

JonathanHenson commented Aug 11, 2023

BadariP commented Aug 28, 2023

BadariP commented Sep 5, 2023

JonathanHenson commented Sep 7, 2023

BadariP commented Oct 21, 2023

BadariP commented Dec 1, 2023

JonathanHenson commented Jan 1, 2024

raviagiri commented Jan 1, 2024 via email

pbadari commented Apr 16, 2024

dr-m commented May 3, 2024

pbadari commented May 3, 2024 • edited Loading

dr-m commented May 9, 2024

pbadari commented May 9, 2024 • edited Loading

DmitriyMusatkin commented May 9, 2024

dr-m commented May 10, 2024

pbadari commented May 10, 2024

pbadari commented May 10, 2024

dr-m commented May 11, 2024

JonathanHenson commented Jul 19, 2023 •

edited

Loading

bdonlan Jul 19, 2023 •

edited

Loading

bdonlan Jul 19, 2023 •

edited

Loading

bdonlan Jul 19, 2023 •

edited

Loading

bdonlan Jul 19, 2023 •

edited

Loading

JonathanHenson Jul 24, 2023 •

edited

Loading

JonathanHenson commented Aug 8, 2023 •

edited

Loading

pbadari commented May 3, 2024 •

edited

Loading

pbadari commented May 9, 2024 •

edited

Loading