-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. #72
base: main
Are you sure you want to change the base?
Conversation
Avx512 support
Add sse42 avx512_intrinsics support
…, turn it off once we know what's supported where.
* length must be at least 256, and a multiple of 64. Based on: | ||
* | ||
* "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" | ||
* V. Gopal, E. Ozturk, et al., 2009, http://intel.ly/2ySEwL0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This citation is a broken link, do we have an updated one? There's a few copies floating around but I think we'd prefer an authoritative source if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authoritative source for the paper: http://download.intel.com/design/intarch/papers/323102.pdf
static zalign_8 k1k2[8] = | ||
{0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86, 0xdcb17aa4, 0xb9e02b86}; | ||
|
||
static zalign_8 k3k4[8] = | ||
{0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8, 0x740eef02, 0x9e4addf8}; | ||
static zalign_2 k5k6[2] = {0xf20c0dfe, 0x14cd00bd6}; | ||
static zalign_2 k7k8[2] = {0xdd45aab8, 0x000000000}; | ||
static zalign_2 poly[2] = {0x105ec76f1, 0xdea713f1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably be static const.
y5 = _mm512_loadu_si512((__m512i *)(input + 0x00)); | ||
y6 = _mm512_loadu_si512((__m512i *)(input + 0x40)); | ||
y7 = _mm512_loadu_si512((__m512i *)(input + 0x80)); | ||
y8 = _mm512_loadu_si512((__m512i *)(input + 0xC0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to move the loads up to the start of the loop, since they have potentially longer latency than clmul? (depending on whether the data is cached)
Edit: I see now this is effectively front-loading the loads from L53-L56, never mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, was prefetching considered?
/* | ||
* Fold 512-bits to 384-bits. | ||
*/ | ||
a0 = _mm_load_si128((__m128i *)k5k6); | ||
|
||
a1 = _mm512_extracti32x4_epi32(x1, 0); | ||
a2 = _mm512_extracti32x4_epi32(x1, 1); | ||
|
||
a3 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x11); | ||
|
||
a1 = _mm_xor_si128(a1, a3); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
/* | ||
* Fold 384-bits to 256-bits. | ||
*/ | ||
a2 = _mm512_extracti32x4_epi32(x1, 2); | ||
a3 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x11); | ||
a1 = _mm_xor_si128(a1, a3); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
/* | ||
* Fold 256-bits to 128-bits. | ||
*/ | ||
a2 = _mm512_extracti32x4_epi32(x1, 3); | ||
a3 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x11); | ||
a1 = _mm_xor_si128(a1, a3); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
/* | ||
* Fold 128-bits to 64-bits. | ||
*/ | ||
a2 = _mm_clmulepi64_si128(a1, a0, 0x10); | ||
a3 = _mm_setr_epi32(~0, 0, ~0, 0); | ||
a1 = _mm_srli_si128(a1, 8); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
a0 = _mm_loadl_epi64((__m128i *)k7k8); | ||
a2 = _mm_srli_si128(a1, 4); | ||
a1 = _mm_and_si128(a1, a3); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_xor_si128(a1, a2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this sequence can be improved. This sequence contends a lot on port 5 on alder lake architectures (which handles CLMUL as well as EXTRACTI32X4) and also has a lot of interdependent instruction. In particular, clmul and extract operations both run on port 5 with latency 3 (for alder lake), so avoiding serial sequences of these instructions would be preferable.
One way to reduce this latency would be to shift all of the 64-bit lanes into their final position in one go, before doing any XORs to merge them together. This means that we'd have one phase of CLMULs and XORs to shift into a series of 128-bit lanes within our 512-bit register, then a phase of further EXTRACT/XOR reduction to 128 bits before going into Barret reduction.
Sketch would look a bit like:
x0 = _mm512_load_si512((__m512i *)ksplat); // computing ksplat is left as an exercise for the
// reader
// Shift all 128-bit lanes into their final positions
x2 = _mm512_clmulepi64_epi128(x1, x0, 0x00); // p5, lat=3, 1/tp=1
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11);
x1 = __mm512_xor_si512(x1, x2);
// Now merge down to 128 bits. Here, again, EXTRACT (p5) and XOR (p05) should run in parallel
// as much as possible.
a0 = __mm512_extracti32x4_epi32(x1, 0); // p5, lat=3, 1/tp=1
a1 = __mm512_extracti32x4_epi32(x1, 1);
a2 = __mm512_extracti32x4_epi32(x1, 2);
a3 = __mm512_extracti32x4_epi32(x1, 3);
a0 = __mm_xor_si128(a0, a1); // p015, lat=1, 1/tp=0.33
a0 = __mm_xor_si128(a0, a2);
a0 = __mm_xor_si128(a0, a3);
// continue with barret reduction
While this would need to be benchmarked, I'd expect it to have significantly lower latency due to a reduction of dependencies between high-latency instructions. The main downside is ksplat will be larger, resulting in larger cache footprint (but, we save on I-cache size, so maybe it's a wash?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is definitely possible to fold from 512 to 256, then from 256 to 128 bits (less steps than above) - example fold from 256 to 128 is here https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L136
here is example of folding 8x128-bits into 1x128 bits using different constants in one step
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L184C1-L184C1
/* | ||
* Fold into 512-bits. | ||
*/ | ||
x0 = _mm512_load_si512((__m512i *)k3k4); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x2); | ||
x1 = _mm512_xor_si512(x1, x5); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x3); | ||
x1 = _mm512_xor_si512(x1, x5); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x4); | ||
x1 = _mm512_xor_si512(x1, x5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to have a path that skips forward to the loop at L121 without doing this reduction for inputs smaller than 256 bytes. This should be free as we have to compare the input length anyway before the loop at L68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, note that initial crc value needs to be handled separately then
|
||
/* For small input, forget about alignment checks - simply compute the CRC32c one byte at a time */ | ||
if (length < (int)sizeof(slice_ptr_int_type)) { | ||
while (length-- > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 16/32/64-bit variants of the CRC32 op that we should probably take advantage of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will use the 64 bit version on x64 and 32bit on x86. That’s what those typedefs and defines in the private header are for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will use the 64 bit version on x64 and 32bit on x86. That’s what those typedefs and defines in the private header are for.
Nevermind, I was at the wrong place in the file. This branch is for tiny inputs, and it doesn’t seem worth optimizing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested in performance for small inputs as well - but I'd agree that's more likely to be dwarfed by surrounding code for sure.
while (length >= 64) { | ||
x2 = _mm512_loadu_si512((__m512i *)input); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x2); | ||
x1 = _mm512_xor_si512(x1, x5); | ||
|
||
input += 64; | ||
length -= 64; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's any opportunity to use mask registers to perform a sub-64-byte CRC operation? This would remove the need for a byte-by-byte loop, replacing it with a _mm512_mask_load8_epi8
operation, an extra round of CLMUL, and some ancillary scalar logic to compute the right mask value and select the right set of shift constants from a static table.
For this to work, we'd need to prepare a table of multiplication constants for various shift lengths, which would increase our cache footprint. There's probably a tradeoff here - maybe have a table for each 64-bit increment and use the CRC32
instruction for the remaining bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution may be possible, loaded message has to match right constants as well. ISA-L and ipsec-mb crc implementations have special code path for less than 256 bytes (then multiple cases considered there) - essentially it processed the message in 16 byte chunks
tests/crc_test.c
Outdated
|
||
struct aws_byte_buf avx_buf; | ||
/* enough for two avx512 runs */ | ||
aws_byte_buf_init(&avx_buf, allocator, 512); | ||
aws_device_random_buffer(&avx_buf); | ||
|
||
uint32_t crc = aws_checksums_crc32c_sw(avx_buf.buffer, (int)avx_buf.len, 0); | ||
uint32_t hw_crc = aws_checksums_crc32c_hw(avx_buf.buffer, (int)avx_buf.len, 0); | ||
|
||
aws_byte_buf_clean_up(&avx_buf); | ||
ASSERT_UINT_EQUALS(hw_crc, crc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should test various sizes of unaligned buffers as well. I'd prefer to test every buffer size from 0 to 512+1 bytes, and have some tests that deliberatly unalign the buffer, and/or put it just before a page boundary where the next page is inaccessible.
x0 = _mm512_load_si512((__m512i *)k3k4); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x2); | ||
x1 = _mm512_xor_si512(x1, x5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values that start out in x1 are effectively being multiplied three times by k3k4. (Similarly, x2 is multiplied twice, x3 once). Would we be able to reduce the dependency chain here by having three premultiplied constants instead of one, and XORing at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that having three premultiplied constants does increase cache usage - we could avoid this by multiplying k3k4 multiple times into each lane instead. That is:
x4 = MUL128(k3k4, k3k4) // CLMUL * 2, XOR; lat=5, 1/tp=2 (contends on port 5 for CLMUL)
x1 = MUL128(x1, k3k4)
x3 = MUL128(x3, k3k4)
x2 = MUL128(x2, x4) // waits 2 cycles for CLMUL latency
x1 = MUL128(x1, x4) // waits 2 cycles for CLMUL latency
x1 = (x2 ^ x3) ^ x1 // the first XOR can execute in parallel with the last MUL128
While this requires more multiplications (10 CLMULs vs 6), because CLMULs have a high latency, this approach might end up better utilizing the CLMUL capacity of the CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought the premultiplication trick might not work, since we're not doing a real 128-bit multiplication operation. It's probably simpler to just load multiple constants... though, it would be interesting to see what it would cost to compute this constant using a round of barett reduction, and whether this could run in parallel with the memory loads going on in the first part of this function.
if (detected_sse42 && detected_clmul) { | ||
return aws_checksums_crc32c_sse42(input, length, crc); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the sse42 impl be worth invoking after we've finished processing the lead portion using the avx512 implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments. I will review them closely.
For full disclosure - I re-used the avx512 intrinsic implementation from https://chromium.googlesource.com/chromium/src/third_party/zlib/+/b890619bc2b193b8fbe9c1c053f4cd19a9791d92/crc32_simd.c
but recomputed constants for crc32c polynomial :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbadari I've got the build surgery done and tests passing if you'd like to work on the avx512 comments from bdonlan@
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll need the AVX512 branch from aws-c-common until we merge it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I am reviewing the avx512 comments from bdonlan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jonathan, Most of the review comments (from Donlan) for the AVX512 code is further performance improvements which require careful re-write/proto-type and performance analysis. I reached out to our area expert for his input. I am wondering if we can merge the current patch and update it further when we have new code ready? Please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, AVX512 has a history of being an incredibly risky addition to an already functioning (possibly already running really hot) system. Various chipset versions introduce timing issues for side-channel attacks as well as side-effects to other processes sharing the CPU. So we're going to have to run a lot of tests before we can just run this in production anyways, and we'd like the code to be close to structured in the actual final algorithm before running that analysis.
This particular code runs for a lot of S3 PUT and GET operations across multiple SDKs, so any side-effects would most likely be felt across a large blast radius.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are submitting AVX-512 based implementation of crc32 to zlib-chromium as well and want to make sure that both code contributions are licensed appropriately. For now, can we withdraw the patch/submission. I will resolve the issue and re-submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on crc performance improvements using new instructions. I am adding a few pointers and comments based on my previous experience with CRC implementations.
x1 = _mm512_xor_si512(x1, x5); | ||
x2 = _mm512_xor_si512(x2, x6); | ||
x3 = _mm512_xor_si512(x3, x7); | ||
x4 = _mm512_xor_si512(x4, x8); | ||
|
||
x1 = _mm512_xor_si512(x1, y5); | ||
x2 = _mm512_xor_si512(x2, y6); | ||
x3 = _mm512_xor_si512(x3, y7); | ||
x4 = _mm512_xor_si512(x4, y8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ternary logic operation code 0x96 (3-way xor) can be used to reduce number of xor's here (and other places below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recent compiler versions would seem to come up with some ternary logic instructions on their own. For what it is worth, today I came up with an intuitive way of defining the 0x96:
class ternary
{
static constexpr uint8_t A= 0b11110000;
static constexpr uint8_t B= 0b11001100;
static constexpr uint8_t C= 0b10101010;
public:
static constexpr uint8_t XOR3= A ^ B ^ C;
static constexpr uint8_t XNOR3= uint8_t(~(A ^ B ^ C));
static constexpr uint8_t XOR2_AND= (A ^ B) & C;
};
/** @return a^b^c */
static inline __m128i xor3_128(__m128i a, __m128i b, __m128i c)
{
return _mm_ternarylogic_epi64(a, b, c, ternary::XOR3);
}
In last-millennium C, you might write something like the following:
#define TERNARY_A 0xf0
#define TERNARY_B 0xcc
#define TERNARY_C 0xaa
#define TERNARY_XOR3 (TERNARY_A ^ TERNARY_B ^ TERNARY_C)
/* | ||
* Fold into 512-bits. | ||
*/ | ||
x0 = _mm512_load_si512((__m512i *)k3k4); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x2); | ||
x1 = _mm512_xor_si512(x1, x5); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x3); | ||
x1 = _mm512_xor_si512(x1, x5); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x4); | ||
x1 = _mm512_xor_si512(x1, x5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, note that initial crc value needs to be handled separately then
while (length >= 64) { | ||
x2 = _mm512_loadu_si512((__m512i *)input); | ||
|
||
x5 = _mm512_clmulepi64_epi128(x1, x0, 0x00); | ||
x1 = _mm512_clmulepi64_epi128(x1, x0, 0x11); | ||
x1 = _mm512_xor_si512(x1, x2); | ||
x1 = _mm512_xor_si512(x1, x5); | ||
|
||
input += 64; | ||
length -= 64; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution may be possible, loaded message has to match right constants as well. ISA-L and ipsec-mb crc implementations have special code path for less than 256 bytes (then multiple cases considered there) - essentially it processed the message in 16 byte chunks
/* | ||
* Fold 512-bits to 384-bits. | ||
*/ | ||
a0 = _mm_load_si128((__m128i *)k5k6); | ||
|
||
a1 = _mm512_extracti32x4_epi32(x1, 0); | ||
a2 = _mm512_extracti32x4_epi32(x1, 1); | ||
|
||
a3 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x11); | ||
|
||
a1 = _mm_xor_si128(a1, a3); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
/* | ||
* Fold 384-bits to 256-bits. | ||
*/ | ||
a2 = _mm512_extracti32x4_epi32(x1, 2); | ||
a3 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x11); | ||
a1 = _mm_xor_si128(a1, a3); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
/* | ||
* Fold 256-bits to 128-bits. | ||
*/ | ||
a2 = _mm512_extracti32x4_epi32(x1, 3); | ||
a3 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x11); | ||
a1 = _mm_xor_si128(a1, a3); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
/* | ||
* Fold 128-bits to 64-bits. | ||
*/ | ||
a2 = _mm_clmulepi64_si128(a1, a0, 0x10); | ||
a3 = _mm_setr_epi32(~0, 0, ~0, 0); | ||
a1 = _mm_srli_si128(a1, 8); | ||
a1 = _mm_xor_si128(a1, a2); | ||
|
||
a0 = _mm_loadl_epi64((__m128i *)k7k8); | ||
a2 = _mm_srli_si128(a1, 4); | ||
a1 = _mm_and_si128(a1, a3); | ||
a1 = _mm_clmulepi64_si128(a1, a0, 0x00); | ||
a1 = _mm_xor_si128(a1, a2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is definitely possible to fold from 512 to 256, then from 256 to 128 bits (less steps than above) - example fold from 256 to 128 is here https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L136
here is example of folding 8x128-bits into 1x128 bits using different constants in one step
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx512_t2/crc32_by16_vclmul_avx512.asm#L184C1-L184C1
#if defined(AWS_HAVE_AVX512_INTRINSICS) && (INTPTR_MAX == INT64_MAX) | ||
int chunk_size = length & ~63; | ||
|
||
if (detected_avx512 && detected_vpclmulqdq && detected_clmul) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say detect_clmul is not needed, as it will always be true if detected_vpclmulqdq is true
|
||
/* | ||
* crc32c_avx512(): compute the crc32c of the buffer, where the buffer | ||
* length must be at least 256, and a multiple of 64. Based on: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcing the length to be multiple of 64 bytes looks too restrictive. K-masks can be used to deal with buffer lengths that are not multiple of 64 bytes. Any reason for this condition?
Addressed most of the above comments and created a new PR to merge/review changes: |
Internally, this PR inspired lots of convo and resulted in some more code being sent to contribute here for avx512 crc32 as well. If you created the PR from your own fork it won’t work, because the CI needs creds from someone authorized for our AWS account to run. If so I’ll move it to a branch and apply the new additional changes on top as appropriate. Thanks for this contribution! |
Changes in #73 are for the AVX512 branch. And also, looking at the code crc32 version is NOT (intel) optimized. Is it not used heavily? Do we need to provide optimized version for it too? If so, we may want to have a common implementation for both (crc32 and crc32c) extracting the constants out. Please Let me know. |
Jonathan, any update on the new branch with all the changes for us to review/analyze? |
It's the next item on my list. Hopefully I'll be able to pick it back up later today |
i now have common code with avx512 support for crc32 and crc32c. I will wait for the previous changes to be merged to the branch before submitting the new code. |
Jonathan, any update? |
Yes, sorry, i'm still finishing up some other high priority work and will switch back to this as soon as I possibly can. |
Jonathan, any update? |
Hi Jonathan, Can you please let me know next steps? Anything I can do to help? |
sorry just now getting back to this to try and finish up my work for the year and I noticed you just wanted pr #73 merged. I'm sorry for the delay. I got pulled of into other work. I've merged it over to this branch. |
Thanks Jonathan, wish you a happy new year.
We would like to setup time with you sometime in Jan to discuss priorities for our planned contributions to AWS CRT.
Regards,
--
Ravi
…________________________________
From: Jonathan M. Henson ***@***.***>
Sent: Sunday, December 31, 2023 6:36:49 PM
To: awslabs/aws-checksums ***@***.***>
Cc: Giri, Ravi A ***@***.***>; Comment ***@***.***>
Subject: Re: [awslabs/aws-checksums] AVX512 support, new PR to replace old branch from contributor. We need CI to run so moving it here. (PR #72)
sorry just now getting back to this to try and finish up my work for the year and I noticed you just wanted pr #73<#73> merged. I'm sorry for the delay. I got pulled of into other work. I've merged it over to this branch.
—
Reply to this email directly, view it on GitHub<#72 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASX3HQSEUMBELQDGHOKBXQDYMIOLDAVCNFSM6AAAAAA2PECPCKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGEYTINZSGI>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Jonathan, please let me know what I can do to help to push this work along. |
This week, I finished translating some hand-written NASM code from https://github.com/intel/intel-ipsec-mb/ into C++11 and created https://github.com/dr-m/crc32_simd/. There are no alignment or buffer size restrictions in that implementation. The main loop processes 256 bytes at a time. Any leftovers from the smaller special cases (multiples of 128, 64, 32, 64, 32, 16 bytes) will be handled by a masked On a system where I tested the Should I create a separate pull request for this? |
I have not reviewed your changes, but did a quick test and ran into this.
|
Thank you for taking a look at it, @pbadari. Can you provide some more details, such as the |
I was able to get the test working with this change:
I also did a quick performance test comparing your implementation against the one I am working on. Looks like perf is pretty close. Good thing about your code is that, its common code for crc32 and crc32c. Are you planning to clean up and submit for aws-checksums project? |
Thank you. I made this off-by-one mistake in dr-m/crc32_simd@ac55c31 when I implemented an option to invoke the CRC functions on input files. Apparently I forgot to rerun the test program without any arguments, on an AVX512 capable machine where that part of the code would run. I remember that I did run it with some arguments.
Yes, I would love to. I looked around and did not find any other SIMD optimized CRC library for C or C++ that would target multiple ISA (other than as part of a bigger project, such as the Do you have any use for the non-reflected polynomials? If not, the conversion should be rather trivial. |
Along with those PRs, we need following to be merged: #88: incorrect checksum due to extra bit flip. |
Yes. Refactoring to C would be nice.
aws-checksums polymonials are reflected. So, if you can provide only reflected version, would be nice. |
This implementation is based on crc32_refl_by16_vclmul_avx512 in https://github.com/intel/intel-ipsec-mb/ with some optimizations. Some of the code is based on awslabs#72.
I filed #90 for my version. |
This implementation is based on crc32_refl_by16_vclmul_avx512 in https://github.com/intel/intel-ipsec-mb/ with some optimizations. Changes to CMakeLists.txt and source/intel/asm/crc32c_sse42_asm.c are based on awslabs#72. This also fixes a bug in aws_checksums_crc32c_hw() when 128-bit pclmul is not available. crc_intrin_fn was being invoked on bytes instead of 32-bit or 64-bit words. The aws-checksums-tests was extended to cover all SIMD implementations. Note: The availability of the Intel CRC-32C instructions is checked as part of testing AWS_CPU_FEATURE_SSE_4_2. Both ISA extensions were introduced in the Intel Nehalem microarchitecture. For compiling this, https://github.com/awslabs/aws-c-common must be installed and CMAKE_MODULE_PATH must point to it, e.g.: cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake. The AWS_CPU_FEATURE_AVX512 currently only checks for AVX512F and not other features that this implementation depends on: AVX512VL, AVX512BW, AVX512DQ. According to https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 there currently exist no CPUs that would support VPCLMULQDQ without supporting all those AVX512 features. The architecture target evex512 is something that was introduced as mandatory in GCC 14 and clang 18 as part of introducing the AVX10.1-512 target, which basically is a new name for a number of AVX512 features. Older compilers do not recognize this target, but they do emit EVEX encoded instructions.
This implementation is based on crc32_refl_by16_vclmul_avx512 in https://github.com/intel/intel-ipsec-mb/ with some optimizations. Changes to CMakeLists.txt and source/intel/asm/crc32c_sse42_asm.c are based on awslabs#72. This also fixes a bug in aws_checksums_crc32c_hw() when 128-bit pclmul is not available. crc_intrin_fn was being invoked on bytes instead of 32-bit or 64-bit words. The aws-checksums-tests was extended to cover all SIMD implementations. Note: The availability of the Intel CRC-32C instructions is checked as part of testing AWS_CPU_FEATURE_SSE_4_2. Both ISA extensions were introduced in the Intel Nehalem microarchitecture. For compiling this, https://github.com/awslabs/aws-c-common must be installed and CMAKE_MODULE_PATH must point to it, e.g.: cmake -DCMAKE_MODULE_PATH=/usr/local/lib/cmake. The AWS_CPU_FEATURE_AVX512 currently only checks for AVX512F and not other features that this implementation depends on: AVX512VL, AVX512BW, AVX512DQ. According to https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 there currently exist no CPUs that would support VPCLMULQDQ without supporting all those AVX512 features. The architecture target evex512 is something that was introduced as mandatory in GCC 14 and clang 18 as part of introducing the AVX10.1-512 target, which basically is a new name for a number of AVX512 features. Older compilers do not recognize this target, but they do emit EVEX encoded instructions.
scratch space for this:
#68
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.