Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fletcher4: Incremental updates and ctx calculation #5164

Merged
merged 3 commits into from
Oct 7, 2016

Conversation

ironMann
Copy link
Contributor

Patch fixes ABI issues with fletcher4 code, adds support for incremental updates, and adds ztest method for testing.
Partly discussed in #5093

@ironMann ironMann force-pushed the simd_incr_fletcher branch 3 times, most recently from f41a9d3 to 9e5b0a6 Compare September 28, 2016 21:53
@behlendorf
Copy link
Contributor

@tuxoko could you review and comment on this alternate approach to the one you've proposed in #5093.

@@ -404,7 +386,7 @@ fletcher_4_native_impl(const fletcher_4_ops_t *ops, const void *buf,
ops->fini_native(zcp);
}

void
inline void
fletcher_4_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
Copy link
Contributor

@tuxoko tuxoko Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change line 407 to scalar, this would cause unnecessary recursion. The same goes the byteswap version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. There's no ctx handling overhead whenever we explicitly want the scalar version.

fletcher_4_incremental_impl(boolean_t native, const void *buf, uint64_t size,
zio_cksum_t *zcp)
{
static const uint64_t FLETCHER_4_INC_MAX = 8ULL << 20;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment about the reason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@behlendorf behlendorf Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's misleading for FLETCHER_4_INC_MAX to be a variable. Why not make this a #define with a clear comment... which I see you've done.

VERIFY0(fletcher_4_impl_set("cycle"));

while (run_count-- > 0) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

fletcher_4_incremental_impl(boolean_t native, const void *buf, uint64_t size,
zio_cksum_t *zcp)
{
static const uint64_t FLETCHER_4_INC_MAX = 8ULL << 20;
Copy link
Contributor

@behlendorf behlendorf Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's misleading for FLETCHER_4_INC_MAX to be a variable. Why not make this a #define with a clear comment... which I see you've done.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you were able to abstract away the context so none of the consumers need to be aware of this implementation detail.

@tuxoko any objection to adopting this patch stack in favor of #5093?

#if defined(__x86_64) && defined(HAVE_AVX512F)
zfs_fletcher_avx512_t avx512[4];
#endif
} fletcher_4_ctx_t;
Copy link
Contributor

@behlendorf behlendorf Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When all implementations are available this context ends up being fairly large, 480 bytes by my quick math. This isn't huge but I noticed it is stored on the stack which might lead to problems. We may want to make this a union to avoid wasting the space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf ctx is already an union, so on x86 it's going to be 256B (avx512)

@tuxoko
Copy link
Contributor

tuxoko commented Oct 1, 2016

@behlendorf
Please go ahead.

All users of fletcher4 methods must call `fletcher_4_init()/_fini()`
There's no benchmarking overhead when called from user-space.

Signed-off-by: Gvozden Neskovic <[email protected]>
@behlendorf
Copy link
Contributor

@ironMann are you happy with this change and consider it ready to merge? If so can you rebase this on master to resolve the new zfs_fletcher.c conflict.

@tuxoko if you're happy with these changes can you approve it.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the stack usage comment, which I don't think should be changed until/if it becomes an issue this LGTM.

Combine incrementally computed fletcher4 checksums. Checksums are combined
a posteriori, allowing for parallel computation on chunks to be implemented if
required. The algorithm is general, and does not add changes in each SIMD
implementation.
New test in ztest verifies incremental fletcher computations.

Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are
respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer
`b` (divided by sizeof(uint32_t)) is:

Cab[A] = Cb[A] + Ca[A]
Cab[B] = Cb[B] + Ca[B] + s * Ca[A]
Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A]
Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A]

NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation
is performed on 8MiB chunks.

Signed-off-by: Gvozden Neskovic <[email protected]>
Init, compute, and fini methods are changed to work on internal context object.
This is necessary because ABI does not guarantee that SIMD registers will be preserved
on function calls. This is technically the case in Linux kernel in between
`kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that
don't require disabling preemption for using SIMD (osx).

Use scalar compute methods in-place for small buffers, and when the buffer size
does not meet SIMD size alignment.

Signed-off-by: Gvozden Neskovic <[email protected]>
@ironMann ironMann force-pushed the simd_incr_fletcher branch from dbb1ef8 to 5bf703b Compare October 5, 2016 14:53
@ironMann
Copy link
Contributor Author

ironMann commented Oct 5, 2016

@behlendorf Rebased. I think this is the best I can do for now. 5cc78dc changes cksum interface to add contex initialization, but that is not helpful for fletcher (we need zeroed ctx). Also, that patch has various cksum contexts on stack, like Skein_256_Ctxt_t, SHA1_CTX, SHA2_CTX... If that becomes an issue, it should be solved across the board.

@behlendorf behlendorf merged commit 482cd9e into openzfs:master Oct 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants