-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fletcher4: Incremental updates and ctx calculation #5164
Conversation
f41a9d3
to
9e5b0a6
Compare
@@ -404,7 +386,7 @@ fletcher_4_native_impl(const fletcher_4_ops_t *ops, const void *buf, | |||
ops->fini_native(zcp); | |||
} | |||
|
|||
void | |||
inline void | |||
fletcher_4_native(const void *buf, uint64_t size, zio_cksum_t *zcp) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change line 407 to scalar, this would cause unnecessary recursion. The same goes the byteswap version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. There's no ctx handling overhead whenever we explicitly want the scalar version.
fletcher_4_incremental_impl(boolean_t native, const void *buf, uint64_t size, | ||
zio_cksum_t *zcp) | ||
{ | ||
static const uint64_t FLETCHER_4_INC_MAX = 8ULL << 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment about the reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's misleading for FLETCHER_4_INC_MAX to be a variable. Why not make this a #define with a clear comment... which I see you've done.
9e5b0a6
to
c513ab8
Compare
VERIFY0(fletcher_4_impl_set("cycle")); | ||
|
||
while (run_count-- > 0) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
fletcher_4_incremental_impl(boolean_t native, const void *buf, uint64_t size, | ||
zio_cksum_t *zcp) | ||
{ | ||
static const uint64_t FLETCHER_4_INC_MAX = 8ULL << 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's misleading for FLETCHER_4_INC_MAX to be a variable. Why not make this a #define with a clear comment... which I see you've done.
c513ab8
to
dbb1ef8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if defined(__x86_64) && defined(HAVE_AVX512F) | ||
zfs_fletcher_avx512_t avx512[4]; | ||
#endif | ||
} fletcher_4_ctx_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When all implementations are available this context ends up being fairly large, 480 bytes by my quick math. This isn't huge but I noticed it is stored on the stack which might lead to problems. We may want to make this a union to avoid wasting the space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behlendorf ctx is already an union, so on x86 it's going to be 256B (avx512)
@behlendorf |
All users of fletcher4 methods must call `fletcher_4_init()/_fini()` There's no benchmarking overhead when called from user-space. Signed-off-by: Gvozden Neskovic <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the stack usage comment, which I don't think should be changed until/if it becomes an issue this LGTM.
Combine incrementally computed fletcher4 checksums. Checksums are combined a posteriori, allowing for parallel computation on chunks to be implemented if required. The algorithm is general, and does not add changes in each SIMD implementation. New test in ztest verifies incremental fletcher computations. Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer `b` (divided by sizeof(uint32_t)) is: Cab[A] = Cb[A] + Ca[A] Cab[B] = Cb[B] + Ca[B] + s * Ca[A] Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A] Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A] NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation is performed on 8MiB chunks. Signed-off-by: Gvozden Neskovic <[email protected]>
Init, compute, and fini methods are changed to work on internal context object. This is necessary because ABI does not guarantee that SIMD registers will be preserved on function calls. This is technically the case in Linux kernel in between `kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that don't require disabling preemption for using SIMD (osx). Use scalar compute methods in-place for small buffers, and when the buffer size does not meet SIMD size alignment. Signed-off-by: Gvozden Neskovic <[email protected]>
dbb1ef8
to
5bf703b
Compare
@behlendorf Rebased. I think this is the best I can do for now. 5cc78dc changes cksum interface to add contex initialization, but that is not helpful for fletcher (we need zeroed ctx). Also, that patch has various cksum contexts on stack, like |
Patch fixes ABI issues with fletcher4 code, adds support for incremental updates, and adds ztest method for testing.
Partly discussed in #5093