Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SHA256 fast implementations #1

Closed
wants to merge 4 commits into from
Closed

Conversation

cybojanek
Copy link
Owner

@cybojanek cybojanek commented Sep 6, 2021

Motivation and Context

Improves sha256 hash performance by using avx, avx2, ssse3, or sha CPU extensions.

Description

  • Add SHA extension detection
  • Add icp algorithm selector
  • Use selector with existing sha2 code
  • Add fast implementations

How Has This Been Tested?

NOT YET TESTED

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

TODO

  • Fix sha256_avx2 R_X86_64_32S relocation errors
  • Run existing tests
  • Add new tests for different implementations
  • Update commit messages

module/icp/algs/impl/impl.c Show resolved Hide resolved
module/icp/algs/impl/impl.c Outdated Show resolved Hide resolved
module/icp/algs/impl/impl.c Show resolved Hide resolved
module/icp/asm-x86_64/sha2/sha256_avx2.S Outdated Show resolved Hide resolved
module/icp/asm-x86_64/sha2/sha256_avx2.S Outdated Show resolved Hide resolved
tests/zfs-tests/tests/functional/checksum/Makefile.am Outdated Show resolved Hide resolved
module/icp/include/sha2/sha2_impl.h Show resolved Hide resolved
module/icp/include/impl/impl.h Show resolved Hide resolved
module/icp/algs/sha2/sha2.c Outdated Show resolved Hide resolved
module/icp/algs/sha2/sha2.c Outdated Show resolved Hide resolved
@rincebrain
Copy link

A sha256_bench in /proc/spl/kstat/zfs/ for these, like with fletcher_4_bench or vdev_raidz_bench, to compare the performance of the different options (in case, say, you want to see if using only ssse3 saves you on power usage, but you want to know if the performance hit is likely to make it a nonstarter), would be neat.

Thanks for doing this!

@cybojanek
Copy link
Owner Author

A sha256_bench in /proc/spl/kstat/zfs/ for these, like with fletcher_4_bench or vdev_raidz_bench, to compare the performance of the different options (in case, say, you want to see if using only ssse3 saves you on power usage, but you want to know if the performance hit is likely to make it a nonstarter), would be neat.

Thanks for doing this!

I added the code for this, but still need to:

  • test manually
  • add an automated zfs test
  • decide what to do with the avx2 implementation issue...

@cybojanek
Copy link
Owner Author

I decided to drop the sha256 avx2 implementation. I somewhat understand what is happening, but I don't know how to fix it:

/usr/bin/ld: /home/cybojanek/zfs/lib/libicp/.libs/libicp.a(sha256_avx2.o): relocation R_X86_64_32S against `.rodata.cst512.K256' can not be used when making a shared object; recompile with -fPIC

These are the problematic instructions:

diff --git a/module/icp/asm-x86_64/sha2/sha256_avx2.S b/module/icp/asm-x86_64/sha2/sha256_avx2.S
index 3d74096dd..946173f8e 100644
--- a/module/icp/asm-x86_64/sha2/sha256_avx2.S
+++ b/module/icp/asm-x86_64/sha2/sha256_avx2.S
@@ -603,19 +603,19 @@ last_block_enter:
 
 .align 16
 loop1:
-       vpaddd  K256+0*32(SRND), X0, XFER
+       #vpaddd K256+0*32(SRND), X0, XFER
        vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
-       vpaddd  K256+1*32(SRND), X0, XFER
+       #vpaddd K256+1*32(SRND), X0, XFER
        vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
-       vpaddd  K256+2*32(SRND), X0, XFER
+       #vpaddd K256+2*32(SRND), X0, XFER
        vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
-       vpaddd  K256+3*32(SRND), X0, XFER
+       #vpaddd K256+3*32(SRND), X0, XFER
        vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
 
@@ -624,12 +624,12 @@ loop1:
        jb      loop1
 
 loop2:
-       ## Do last 16 rounds with no scheduling
-       vpaddd  K256+0*32(SRND), X0, XFER
+       # Do last 16 rounds with no scheduling
+       #vpaddd K256+0*32(SRND), X0, XFER
        vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
        DO_4ROUNDS      _XFER + 0*32
 
-       vpaddd  K256+1*32(SRND), X1, XFER
+       #vpaddd K256+1*32(SRND), X1, XFER
        vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
        DO_4ROUNDS      _XFER + 1*32
        add     $2*32, SRND

loop1 is very different for the other implementations - I don't know why they use SRND here.

@jumbi77
Copy link

jumbi77 commented Nov 13, 2021

Just pinging nicely @tcaputi and/or @AttilaFueloep if they are willing/interested to take a look on this PR. The integration to OpenZFS would be nice.

@cybojanek
Copy link
Owner Author

Just pinging nicely @tcaputi and/or @AttilaFueloep if they are willing/interested to take a look on this PR. The integration to OpenZFS would be nice.

@tcaputi @AttilaFueloep @jumbi77

I could really use some help here:

uint8_t *buffer = vmem_alloc(buffer_size, KM_SLEEP);

The vmem_alloc call fails, and the zfs module fails to load.

I stopped after I couldn't figure out how to printk something in that location

@rincebrain
Copy link

I haven't tried building the branch, but when I do printk debugging in OpenZFS, usually it ends up being something like

#ifdef _KERNEL
printk(KERN_INFO "Meowmeowmeowmeow");
#endif

@cybojanek
Copy link
Owner Author

Thanks!

I was able to fix the segfault and some other minor issues.

In the VM on my laptop, I was able to get the following:
NOTE: aes-ni is missing from the results, because virtualbox doesn't support passing it through...I have to still test in EC2

[vagrant@archlinux zfs]$ cat /proc/spl/kstat/zfs/sha256_bench
4 0 0x01 -1 0 6999258145906 7001015547267
implementation   bytes/second   
fastest          383045519      
generic          286355735      
x86_64           373738479      
sha-avx          383045519      
sha-ssse3        337590387      
[vagrant@archlinux zfs]$ cat /proc/spl/kstat/zfs/sha512_bench
5 0 0x01 -1 0 6999274221510 7008276525174
implementation   bytes/second   
fastest          707075025      
generic          445629113      
x86_64           576709922      
sha-avx          635035490      
sha-avx2         707075025      
sha-ssse3        558050472 

Observations:

@cybojanek
Copy link
Owner Author

cybojanek commented Nov 14, 2021

Observations:

  • sha256-avx is the fastest 365 MiB/s vs 356 for x86_64 (and 390 for openssl)
  • sha512-avx2 is the fastest 674 MiB/s vs 549 for x86_64 (and 634 for openssl)

maybe sha256-avx2 would be a little faster, but I can't get it to work due to assembly/compiler error)

@rincebrain
Copy link

rincebrain commented Nov 14, 2021 via email

@cybojanek
Copy link
Owner Author

6.1.28-3 in archlinux

@rincebrain
Copy link

rincebrain commented Nov 14, 2021

Something seems marginally sad on my Virtualbox VM...

$ cat /proc/spl/kstat/zfs/_HMAC_GENERALsha256_benchsha512_bench
5 0 0x01 -1 0 61521213120966 61537713543510
implementation   bytes/second
fastest          871256558
generic          471557921
x86_64           662196349
sha-avx          443100472
sha-avx2         871256558
sha-ssse3        596165769
$ cat /proc/spl/kstat/zfs/KM_SHA_1_HMAC_GENERALsha256_bench
4 0 0x01 -1 0 61521199413772 61584580376598
implementation   bytes/second
fastest          448432886
generic          340025805
x86_64           421138030
sha-avx          272833337
sha-ssse3        448432886
$

(I'm mostly pointing to the /proc path)

Ah, you meant SHA, not AES, which makes sense. I can't find any documentation one way or another about VB passing them through, though the fact that I don't see them in my guest rather indicates ...

(e: Debian 11/bullseye x86_64, if you wanted to try)

edit again:
When I tried a quick change to fix this, and unloaded the (unmodified from this patch) module, I got:

[63598.786790] VERIFY(list_empty(&kstat_module_list)) failed
[63598.786803] PANIC at spl-kstat.c:713:spl_kstat_fini()
[63598.786811] Showing stack for process 1276126
[63598.786813] CPU: 3 PID: 1276126 Comm: rmmod Kdump: loaded Tainted: P           OE     5.10.0-9-amd64 #1 Debian 5.10.70-1
[63598.786813] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[63598.786814] Call Trace:
[63598.786819]  dump_stack+0x6b/0x83
[63598.786823]  spl_panic+0xd4/0xfc [spl]
[63598.786826]  ? pcpu_free_area+0x228/0x380
[63598.786827]  ? kfree+0x438/0x480
[63598.786829]  ? spl_kmem_cache_destroy+0x1a9/0x480 [spl]
[63598.786831]  ? spl_kmem_cache_destroy+0x1a9/0x480 [spl]
[63598.786833]  ? __check_object_size+0x46/0x150
[63598.786836]  spl_kstat_fini+0x47/0x80 [spl]
[63598.786838]  spl_fini+0xa/0xbb [spl]
[63598.786840]  __do_sys_delete_module+0x190/0x300
[63598.786842]  ? exit_to_user_mode_prepare+0x32/0x120
[63598.786844]  do_syscall_64+0x33/0x80
[63598.786846]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[63598.786847] RIP: 0033:0x7f4ffa5797d7
[63598.786848] Code: 73 01 c3 48 8b 0d b9 f6 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 f6 0b 00 f7 d8 64 89 01 48
[63598.786849] RSP: 002b:00007ffe72383d58 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[63598.786850] RAX: ffffffffffffffda RBX: 00005583bddcc760 RCX: 00007f4ffa5797d7
[63598.786850] RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005583bddcc7c8
[63598.786851] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[63598.786851] R10: 00007f4ffa5ecac0 R11: 0000000000000206 R12: 00007ffe72383f80
[63598.786851] R13: 00007ffe7238515c R14: 00005583bddcc2a0 R15: 00005583bddcc760

@rincebrain
Copy link

You'll enjoy this, I think.

4 0 0x01 -1 0 785958048400 894782453400
implementation   bytes/second
fastest          2148105051
generic          320832236
x86_64           427431925
sha-avx          289878085
sha-ssse3        477276285
sha-ni           2148105051

As the youths say, "vroom".

(From my Ryzen 7 PRO 5850U and an Ubuntu 20.04 VM running under Hyper-V, which does believe in sha_ni .)

I also added these to my local copy to allow removing the module without a VERIFY failure:

diff --git a/module/icp/algs/sha2/sha2.c b/module/icp/algs/sha2/sha2.c
index 31bbb2d2d..5a3a69e32 100644
--- a/module/icp/algs/sha2/sha2.c
+++ b/module/icp/algs/sha2/sha2.c
@@ -263,6 +263,11 @@ void sha2_impl_init(void) {
        alg_impl_init(&sha512_conf_impl);
 }

+void sha2_impl_fini(void) {
+       alg_impl_fini(&sha256_conf_impl);
+       alg_impl_fini(&sha512_conf_impl);
+}
+
 #endif /* _KERNEL */

 static uint8_t PADDING[128] = { 0x80, /* all zeros */ };
diff --git a/module/icp/include/sha2/sha2_impl.h b/module/icp/include/sha2/sha2_impl.h
index 7abd2b11c..8af7b9550 100644
--- a/module/icp/include/sha2/sha2_impl.h
+++ b/module/icp/include/sha2/sha2_impl.h
@@ -61,6 +61,7 @@ typedef struct sha2_hmac_ctx {
  * Initialize sha2 implementations.
  */
 void sha2_impl_init(void);
+void sha2_impl_fini(void);

 #ifdef __cplusplus
 }
diff --git a/module/icp/io/sha2_mod.c b/module/icp/io/sha2_mod.c
index b53e742ea..d4fd3b64e 100644
--- a/module/icp/io/sha2_mod.c
+++ b/module/icp/io/sha2_mod.c
@@ -253,6 +253,10 @@ sha2_mod_fini(void)
                sha2_prov_handle = 0;
        }

+#if defined(_KERNEL)
+       sha2_impl_fini();
+#endif
+

@rincebrain
Copy link

I also meant to ask what the compile issue you were having with the sha256 avx2 branch is.

@cybojanek
Copy link
Owner Author

@rincebrain Thanks for the fini code!

Do you have any pointers on how to add tests for cycling through the sha256 implementations?

I'm unfamiliar with the testing code.

@rincebrain
Copy link

rincebrain commented Nov 14, 2021

@rincebrain Thanks for the fini code!

Do you have any pointers on how to add tests for cycling through the sha256 implementations?

I'm unfamiliar with the testing code.

Are you asking about how to invoke (or extend) ztest, or adding tests to the test suite for this, or all of the above?

For the former, I don't usually run it much, but I'd probably crib notes from how the GH runner does it, and look at things like where ztest sets fletcher4 to cycle, and add similar test functions there.

(You should know before running it that at least on the Github runner, it has a nontrivial failure rate even on tests that literally could not touch any code related to it, so it's probably best to treat any issues in ztest as a weak signal (as far as your code) at the moment.)

For the latter, I don't see anywhere that it uses cycle in there.

@cybojanek
Copy link
Owner Author

I also meant to ask what the compile issue you were having with the sha256 avx2 branch is.

Sorry I missed this message earlier.

You can try to fix it on this branch: https://github.com/cybojanek/zfs/tree/sha256_avx2

To keep track of diffs to linux source asm you can do:

wget -c "https://raw.githubusercontent.com/torvalds/linux/v5.14/arch/x86/crypto/sha256-avx2-asm.S"
diff -u sha256-avx2-asm.S module/icp/asm-x86_64/sha2/sha256_avx2.S

- Add HAVE_SHA compiler define
- Add zfs_sha_available function
- Detect SHA in cpu feature bits

Signed-off-by: Jan Kasiak <[email protected]>
@cybojanek
Copy link
Owner Author

Some performance numbers using an EC2 m6i.xlarge instance

echo x86_64 > /sys/module/icp/parameters/icp_sha256_impl

modprobe brd rd_nr=1 rd_size=$((12288 * 1024))

zpool create -f -o ashift=12 \
    -O acltype=posixacl \
    -O relatime=on \
    -O xattr=sa \
    -O dnodesize=legacy \
    -O normalization=formD \
    -O devices=off \
    -O compression=off \
    -O checksum=sha256 \
    zscratch /dev/ram0

dd if=/dev/urandom of=/zscratch/data.bin bs=1M count=12000 status=progress conv=fdatasync
zpool export zscratch

for X in generic x86_64 sha-avx sha-ssse3 sha-ni; do
        echo $X > /sys/module/icp/parameters/icp_sha256_impl
        sleep 1
        cat /sys/module/icp/parameters/icp_sha256_impl

        zpool import zscratch
        echo ""
        dd if=/zscratch/data.bin of=/dev/null bs=1M status=progress
        zpool export zscratch
done
cycle fastest [generic] x86_64 sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 27.6342 s, 433 MB/s

cycle fastest generic [x86_64] sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 21.6019 s, 553 MB/s

cycle fastest generic x86_64 [sha-avx] sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.1928 s, 657 MB/s

cycle fastest generic x86_64 sha-avx [sha-ssse3] sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.7719 s, 637 MB/s

cycle fastest generic x86_64 sha-avx sha-ssse3 [sha-ni]
11952848896 bytes (12 GB, 11 GiB) copied, 6.21747 s, 1.9 GB/s

I also did a similar thing with scrub: zpool export, change algorithm, zpool import, zpool scrub:

x86_64
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Mon Nov 15 01:52:34 2021

sha-avx
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Mon Nov 15 01:53:35 2021

sha-ssse3
  scan: scrub repaired 0B in 00:00:19 with 0 errors on Mon Nov 15 01:54:30 2021

sha-ni
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Mon Nov 15 01:55:03 2021

@cybojanek
Copy link
Owner Author

Closing this in favor of openzfs#12549

@cybojanek cybojanek closed this Nov 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants