-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect SHA extensions #12549
Detect SHA extensions #12549
Conversation
Hi - I see some of CI is failing. I don't think these failures are related to my changes, since this code is not even used at runtime. Please tell me if I should look into the failures. |
@cybojanek Awesome, really looking forward for the referenced improvements. I just want to link some old but never finished PRs related to SHA (not sure if they still apply to current OpenZFS though): Multi-buffer sha256 support in SPL to ZFS (openzfs/spl#646) |
Thanks for the links to the previous issues. Just posting here that I'm still working on this issue. (No ETA - busy with family/work) |
Thanks for working on this, The test failures are known failures, i.e. unrelated to your changes. I'll look into the build failure some more. |
I labeled the PR as "Work in Progress" and will update status once you give the go. |
5f165a6
to
948739b
Compare
Some performance numbers using an EC2 m6i.xlarge instance echo x86_64 > /sys/module/icp/parameters/icp_sha256_impl
modprobe brd rd_nr=1 rd_size=$((12288 * 1024))
zpool create -f -o ashift=12 \
-O acltype=posixacl \
-O relatime=on \
-O xattr=sa \
-O dnodesize=legacy \
-O normalization=formD \
-O devices=off \
-O compression=off \
-O checksum=sha256 \
zscratch /dev/ram0
dd if=/dev/urandom of=/zscratch/data.bin bs=1M count=12000 status=progress conv=fdatasync
zpool export zscratch
for X in generic x86_64 sha-avx sha-ssse3 sha-ni; do
echo $X > /sys/module/icp/parameters/icp_sha256_impl
sleep 1
cat /sys/module/icp/parameters/icp_sha256_impl
zpool import zscratch
echo ""
dd if=/zscratch/data.bin of=/dev/null bs=1M status=progress
zpool export zscratch
done
I also did a similar thing with scrub: zpool export, change algorithm, zpool import, zpool scrub:
|
@tonynguien Hi! I think this is ready for review. @rincebrain Helped me fix a few things in the initial review here cybojanek#1 |
- Add HAVE_SHA compiler define - Add zfs_sha_available function - Detect SHA in cpu feature bits Signed-off-by: Jan Kasiak <[email protected]>
Signed-off-by: Jan Kasiak <[email protected]>
Signed-off-by: Jan Kasiak <[email protected]>
Signed-off-by: Jan Kasiak <[email protected]>
948739b
to
733860f
Compare
Hey,
Please view: |
How did you see those warning messages - do they just show up when you load the module? |
Boot process, dmesg info. I`m build the Linux kernel with Clang + LTO + CFI. Debug from Control-Flow Integrity (CFI).
|
@AndyLavr @cybojanek, that's because SHA-{256,512} implementations are getting casted:
Function/callback casts are not allowed with ClangCFI as they are considered as attacks. I'd say they should be avoided in general, there's always a way to get strict matches. (just BTW, I have a commit here in ZFS repo where I fixed all function casts found with ZTS: 23c13c7) |
I gave this branch a spin on my i7-3770 ... I guess it coulda gone better. 🥲 Benchmark numbers look fine, this CPU doesn't have NI but sse/avx are a pretty good win:
However, stability is a real problem; while heavily accessing my datasets which use SHA512, I get:
|
FWIW, I'm using a PREEMPT kernel:
My amateur guess as to the cause of the problem is that it's the same issue that affected the zfs x64 crypto code; I vaguely recall it's something like, Linux doesn't let non-GPL modules save and restore FPU/spicy regs across context switches, and ZFS' workaround for the crypto code was to explicitly forbid pre-emption for the duration of the crypto code... |
Your guess seems right. Since gcc uses SIMD instructions for optimizing plain C code, failing to preserve the FPU state affects all kind of software, not just the ones doing float calculations. So your modprobe and python fails are lining up with your guess.
Close but not quite. GPL modules can tell the kernel to save and restore FPU state on context switches, meaning the overhead only takes place there. Bering CDDL we can't. On any FPU use we have to disable preemption and save the FPU state and do the reverse when we're done. This incurs quite a big overhead, making the use of SIMD instruction considerably less effective, especially if there are only a few. |
* The non-indented lines are instructions related to the message schedule. | ||
* | ||
* void sha256_ni_transform(uint32_t *digest, const void *data, | ||
uint32_t numBlocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing continuation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by this.
What do you want it to look like?
This is a comment from the original Intel source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised cstyle isn't up in arms about this (okay, less so since it's in a .S):
* The non-indented
*
* void sha256_ni_transform(uint32_t *digest, const void *data,
* uint32_t numBlocks);
L90 is missing the block comment continuation and has a double-tab continuation instead (as opposed to four spaces after the comment *
)
module/icp/algs/impl/impl.c
Outdated
uint64_t run_count = 0; | ||
uint64_t start, run_time_ns; | ||
|
||
kpreempt_disable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this (and calls to them in general) need a kfpu_begin()
/kfpu_end()
wrapper if you're calling out into SIMD/FPU instructions, or you can wind up with problems like @adamdmoss is having. (if you have one and it somehow didn't show up in my grep of the patch, my apologies.)
See, for example, how the GCM code calls it before calling accelerated instructions, or the fletcher or raidz parity code likewise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K - I added the wrapping in dfa1f42
Should I remove the calls to kpreempt_disable
- if I use kfpu_begin
instead will that be ok?
I got the kpreempt_disable
from module/zcommon/zfs_fletcher.c
where they do benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check, the definition of kfpu_begin is basically "kpreempt_disable();
[save FPU state]
", so you should be good doing that.
(Yes I know there's multiple this is just one example.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - now I see it.
I removed the redundant preempt calls from the benchmark portion.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes on (a) const correctness and (b) aggressively consting actually constant data (cf. #12899). Beside the function argument ones, these repeat for the other (512) impls.
dfa1f42
to
5db5df2
Compare
Stability appears good now, thanks. I'll keep an eye on it. |
On a 5+ kernel that doesn't have the required functions patched back in (like the Liquorix kernel builds), right? |
Is this ready to go or is it still a WIP? |
When BLAKE3 is in, I would like to add some additional sha256 and sha512 SIMD code. |
Github seems to have blackholed my email reply, but: |
Yes I know these sources, but I searched for public domain code in the first place ... and will implement SSE2 CC0 code ... which can then be reused on PPC, AARCH64 and so on via SIMDE ... |
5db5df2
to
8969a46
Compare
Glad to see some activity here! I pushed out a one line change I forgot to push out a while ago. |
@cybojanek Can you may rebase this and give an status update? Or is this PR put behind because of mcmilk mentioned work here? Much thanks anyway. |
@mcmilk How does your branch compare to this one? At a quick glance, it looks like you forked off of master, but also have some new code for generalizing the Is it only for freebsd? Or also for Linux? Does it do benchmarking? I don't want to duplicate work nor effort. |
My branch does the same for bsd and linux... it's not ready, but I will also try to include the hardware specific impls. of freebsd as well... Edit: the first commit of my branch removes ALL old SHA2 stuff... and re-implements the generic function with public domain code. The old Sun/Solaris impl. is history then. The generic code is faster and smaller. |
If you'd like to go poke around some more sources of implementations,
Intel has their implementations for all sorts of subsets of x86 and i think
aarch64 in a few places under BSD-3 and Apache-2 over at
https://github.com/intel/intel-ipsec-mb and
https://github.com/intel/ipp-crypto
…On Sun, Apr 24, 2022 at 1:03 PM Tino Reichardt ***@***.***> wrote:
When BLAKE3 <#12918> is in, I would
like to add some additional sha256 and sha512 SIMD code.
I am currently working <https://github.com/mcmilk/sha2-testing> on public
domain code for choosing the implementation for Intel x86-64, PPC64 and
aarch64 architectures....
—
Reply to this email directly, view it on GitHub
<#12549 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7PQ4GR7FMBWTRU7H3DVGV5GVANCNFSM5DV5WUXA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Are these not okay? They are all ready seem to work nice - #13741. |
@cybojanek - we can maybe close this pull request? You can see here: https://github.com/mcmilk/sha2-testing - why I have preferred the openssl variants over the Intel ones. When you try out the current master branch, you can re-check the benchmarks via |
I'm glad something got merged in. Looking forward to seeing it propagate to my distro :D |
Detect SHA CPU extensions
Motivation and Context
Detect and use SHA / vector CPU extensions in order to optimize checksum calculations.
Description
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.