Add SHA256 fast implementations #1

cybojanek · 2021-09-06T18:29:19Z

Motivation and Context

Improves sha256 hash performance by using avx, avx2, ssse3, or sha CPU extensions.

Description

Add SHA extension detection
Add icp algorithm selector
Use selector with existing sha2 code
Add fast implementations

How Has This Been Tested?

NOT YET TESTED

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

TODO

Fix sha256_avx2 R_X86_64_32S relocation errors
Run existing tests
Add new tests for different implementations
Update commit messages

module/icp/algs/impl/impl.c

module/icp/asm-x86_64/sha2/sha256_avx2.S

tests/zfs-tests/tests/functional/checksum/Makefile.am

module/icp/include/sha2/sha2_impl.h

module/icp/include/impl/impl.h

module/icp/algs/sha2/sha2.c

module/icp/asm-x86_64/sha2/sha256_avx2.S

module/icp/algs/impl/impl.c

rincebrain · 2021-09-09T14:14:36Z

A sha256_bench in /proc/spl/kstat/zfs/ for these, like with fletcher_4_bench or vdev_raidz_bench, to compare the performance of the different options (in case, say, you want to see if using only ssse3 saves you on power usage, but you want to know if the performance hit is likely to make it a nonstarter), would be neat.

Thanks for doing this!

cybojanek · 2021-10-17T17:24:41Z

A sha256_bench in /proc/spl/kstat/zfs/ for these, like with fletcher_4_bench or vdev_raidz_bench, to compare the performance of the different options (in case, say, you want to see if using only ssse3 saves you on power usage, but you want to know if the performance hit is likely to make it a nonstarter), would be neat.

Thanks for doing this!

I added the code for this, but still need to:

test manually
add an automated zfs test
decide what to do with the avx2 implementation issue...

cybojanek · 2021-10-17T17:49:36Z

I decided to drop the sha256 avx2 implementation. I somewhat understand what is happening, but I don't know how to fix it:

/usr/bin/ld: /home/cybojanek/zfs/lib/libicp/.libs/libicp.a(sha256_avx2.o): relocation R_X86_64_32S against `.rodata.cst512.K256' can not be used when making a shared object; recompile with -fPIC

These are the problematic instructions:

diff --git a/module/icp/asm-x86_64/sha2/sha256_avx2.S b/module/icp/asm-x86_64/sha2/sha256_avx2.S
index 3d74096dd..946173f8e 100644
--- a/module/icp/asm-x86_64/sha2/sha256_avx2.S
+++ b/module/icp/asm-x86_64/sha2/sha256_avx2.S
@@ -603,19 +603,19 @@ last_block_enter:
 
 .align 16
 loop1:
-       vpaddd  K256+0*32(SRND), X0, XFER
+       #vpaddd K256+0*32(SRND), X0, XFER
        vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
-       vpaddd  K256+1*32(SRND), X0, XFER
+       #vpaddd K256+1*32(SRND), X0, XFER
        vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
-       vpaddd  K256+2*32(SRND), X0, XFER
+       #vpaddd K256+2*32(SRND), X0, XFER
        vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
-       vpaddd  K256+3*32(SRND), X0, XFER
+       #vpaddd K256+3*32(SRND), X0, XFER
        vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
        FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
 
@@ -624,12 +624,12 @@ loop1:
        jb      loop1
 
 loop2:
-       ## Do last 16 rounds with no scheduling
-       vpaddd  K256+0*32(SRND), X0, XFER
+       # Do last 16 rounds with no scheduling
+       #vpaddd K256+0*32(SRND), X0, XFER
        vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
        DO_4ROUNDS      _XFER + 0*32
 
-       vpaddd  K256+1*32(SRND), X1, XFER
+       #vpaddd K256+1*32(SRND), X1, XFER
        vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
        DO_4ROUNDS      _XFER + 1*32
        add     $2*32, SRND

loop1 is very different for the other implementations - I don't know why they use SRND here.

jumbi77 · 2021-11-13T08:26:07Z

Just pinging nicely @tcaputi and/or @AttilaFueloep if they are willing/interested to take a look on this PR. The integration to OpenZFS would be nice.

cybojanek · 2021-11-13T19:46:52Z

Just pinging nicely @tcaputi and/or @AttilaFueloep if they are willing/interested to take a look on this PR. The integration to OpenZFS would be nice.

@tcaputi @AttilaFueloep @jumbi77

I could really use some help here:

zfs/module/icp/algs/impl/impl.c

Line 132 in 89e103b

uint8_t *buffer = vmem_alloc(buffer_size, KM_SLEEP);

The vmem_alloc call fails, and the zfs module fails to load.

I stopped after I couldn't figure out how to printk something in that location

rincebrain · 2021-11-13T19:52:45Z

I haven't tried building the branch, but when I do printk debugging in OpenZFS, usually it ends up being something like

#ifdef _KERNEL
printk(KERN_INFO "Meowmeowmeowmeow");
#endif

cybojanek · 2021-11-14T02:23:11Z

Thanks!

I was able to fix the segfault and some other minor issues.

In the VM on my laptop, I was able to get the following:
NOTE: aes-ni is missing from the results, because virtualbox doesn't support passing it through...I have to still test in EC2

[vagrant@archlinux zfs]$ cat /proc/spl/kstat/zfs/sha256_bench
4 0 0x01 -1 0 6999258145906 7001015547267
implementation   bytes/second   
fastest          383045519      
generic          286355735      
x86_64           373738479      
sha-avx          383045519      
sha-ssse3        337590387      
[vagrant@archlinux zfs]$ cat /proc/spl/kstat/zfs/sha512_bench
5 0 0x01 -1 0 6999274221510 7008276525174
implementation   bytes/second   
fastest          707075025      
generic          445629113      
x86_64           576709922      
sha-avx          635035490      
sha-avx2         707075025      
sha-ssse3        558050472

Observations:

cybojanek · 2021-11-14T02:24:21Z

Observations:

sha256-avx is the fastest 365 MiB/s vs 356 for x86_64 (and 390 for openssl)
sha512-avx2 is the fastest 674 MiB/s vs 549 for x86_64 (and 634 for openssl)

maybe sha256-avx2 would be a little faster, but I can't get it to work due to assembly/compiler error)

rincebrain · 2021-11-14T02:25:57Z

VB added support for passing AESNI through in 5.0...how old a version are you testing with? - Rich

…

On Sat, Nov 13, 2021 at 9:23 PM cybojanek ***@***.***> wrote: Thanks! I was able to fix the segfault and some other minor issues. In the VM on my laptop, I was able to get the following: NOTE: aes-ni is missing from the results, because virtualbox doesn't support passing it through...I have to still test in EC2 ***@***.*** zfs]$ cat /proc/spl/kstat/zfs/sha256_bench 4 0 0x01 -1 0 6999258145906 7001015547267 implementation bytes/second fastest 383045519 generic 286355735 x86_64 373738479 sha-avx 383045519 sha-ssse3 337590387 ***@***.*** zfs]$ cat /proc/spl/kstat/zfs/sha512_bench 5 0 0x01 -1 0 6999274221510 7008276525174 implementation bytes/second fastest 707075025 generic 445629113 x86_64 576709922 sha-avx 635035490 sha-avx2 707075025 sha-ssse3 558050472 Observations: — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

cybojanek · 2021-11-14T02:36:47Z

6.1.28-3 in archlinux

rincebrain · 2021-11-14T02:44:28Z

Something seems marginally sad on my Virtualbox VM...

$ cat /proc/spl/kstat/zfs/_HMAC_GENERALsha256_benchsha512_bench
5 0 0x01 -1 0 61521213120966 61537713543510
implementation   bytes/second
fastest          871256558
generic          471557921
x86_64           662196349
sha-avx          443100472
sha-avx2         871256558
sha-ssse3        596165769
$ cat /proc/spl/kstat/zfs/KM_SHA_1_HMAC_GENERALsha256_bench
4 0 0x01 -1 0 61521199413772 61584580376598
implementation   bytes/second
fastest          448432886
generic          340025805
x86_64           421138030
sha-avx          272833337
sha-ssse3        448432886
$

(I'm mostly pointing to the /proc path)

Ah, you meant SHA, not AES, which makes sense. I can't find any documentation one way or another about VB passing them through, though the fact that I don't see them in my guest rather indicates ...

(e: Debian 11/bullseye x86_64, if you wanted to try)

edit again:
When I tried a quick change to fix this, and unloaded the (unmodified from this patch) module, I got:

[63598.786790] VERIFY(list_empty(&kstat_module_list)) failed
[63598.786803] PANIC at spl-kstat.c:713:spl_kstat_fini()
[63598.786811] Showing stack for process 1276126
[63598.786813] CPU: 3 PID: 1276126 Comm: rmmod Kdump: loaded Tainted: P           OE     5.10.0-9-amd64 #1 Debian 5.10.70-1
[63598.786813] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[63598.786814] Call Trace:
[63598.786819]  dump_stack+0x6b/0x83
[63598.786823]  spl_panic+0xd4/0xfc [spl]
[63598.786826]  ? pcpu_free_area+0x228/0x380
[63598.786827]  ? kfree+0x438/0x480
[63598.786829]  ? spl_kmem_cache_destroy+0x1a9/0x480 [spl]
[63598.786831]  ? spl_kmem_cache_destroy+0x1a9/0x480 [spl]
[63598.786833]  ? __check_object_size+0x46/0x150
[63598.786836]  spl_kstat_fini+0x47/0x80 [spl]
[63598.786838]  spl_fini+0xa/0xbb [spl]
[63598.786840]  __do_sys_delete_module+0x190/0x300
[63598.786842]  ? exit_to_user_mode_prepare+0x32/0x120
[63598.786844]  do_syscall_64+0x33/0x80
[63598.786846]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[63598.786847] RIP: 0033:0x7f4ffa5797d7
[63598.786848] Code: 73 01 c3 48 8b 0d b9 f6 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 f6 0b 00 f7 d8 64 89 01 48
[63598.786849] RSP: 002b:00007ffe72383d58 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[63598.786850] RAX: ffffffffffffffda RBX: 00005583bddcc760 RCX: 00007f4ffa5797d7
[63598.786850] RDX: 000000000000000a RSI: 0000000000000800 RDI: 00005583bddcc7c8
[63598.786851] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[63598.786851] R10: 00007f4ffa5ecac0 R11: 0000000000000206 R12: 00007ffe72383f80
[63598.786851] R13: 00007ffe7238515c R14: 00005583bddcc2a0 R15: 00005583bddcc760

module/icp/algs/impl/impl.c

rincebrain · 2021-11-14T12:24:13Z

You'll enjoy this, I think.

4 0 0x01 -1 0 785958048400 894782453400
implementation   bytes/second
fastest          2148105051
generic          320832236
x86_64           427431925
sha-avx          289878085
sha-ssse3        477276285
sha-ni           2148105051

As the youths say, "vroom".

(From my Ryzen 7 PRO 5850U and an Ubuntu 20.04 VM running under Hyper-V, which does believe in sha_ni .)

I also added these to my local copy to allow removing the module without a VERIFY failure:

diff --git a/module/icp/algs/sha2/sha2.c b/module/icp/algs/sha2/sha2.c
index 31bbb2d2d..5a3a69e32 100644
--- a/module/icp/algs/sha2/sha2.c
+++ b/module/icp/algs/sha2/sha2.c
@@ -263,6 +263,11 @@ void sha2_impl_init(void) {
        alg_impl_init(&sha512_conf_impl);
 }

+void sha2_impl_fini(void) {
+       alg_impl_fini(&sha256_conf_impl);
+       alg_impl_fini(&sha512_conf_impl);
+}
+
 #endif /* _KERNEL */

 static uint8_t PADDING[128] = { 0x80, /* all zeros */ };
diff --git a/module/icp/include/sha2/sha2_impl.h b/module/icp/include/sha2/sha2_impl.h
index 7abd2b11c..8af7b9550 100644
--- a/module/icp/include/sha2/sha2_impl.h
+++ b/module/icp/include/sha2/sha2_impl.h
@@ -61,6 +61,7 @@ typedef struct sha2_hmac_ctx {
  * Initialize sha2 implementations.
  */
 void sha2_impl_init(void);
+void sha2_impl_fini(void);

 #ifdef __cplusplus
 }
diff --git a/module/icp/io/sha2_mod.c b/module/icp/io/sha2_mod.c
index b53e742ea..d4fd3b64e 100644
--- a/module/icp/io/sha2_mod.c
+++ b/module/icp/io/sha2_mod.c
@@ -253,6 +253,10 @@ sha2_mod_fini(void)
                sha2_prov_handle = 0;
        }

+#if defined(_KERNEL)
+       sha2_impl_fini();
+#endif
+

rincebrain · 2021-11-14T17:29:17Z

I also meant to ask what the compile issue you were having with the sha256 avx2 branch is.

cybojanek · 2021-11-14T17:29:39Z

@rincebrain Thanks for the fini code!

Do you have any pointers on how to add tests for cycling through the sha256 implementations?

I'm unfamiliar with the testing code.

rincebrain · 2021-11-14T17:34:58Z

@rincebrain Thanks for the fini code!

Do you have any pointers on how to add tests for cycling through the sha256 implementations?

I'm unfamiliar with the testing code.

Are you asking about how to invoke (or extend) ztest, or adding tests to the test suite for this, or all of the above?

For the former, I don't usually run it much, but I'd probably crib notes from how the GH runner does it, and look at things like where ztest sets fletcher4 to cycle, and add similar test functions there.

(You should know before running it that at least on the Github runner, it has a nontrivial failure rate even on tests that literally could not touch any code related to it, so it's probably best to treat any issues in ztest as a weak signal (as far as your code) at the moment.)

For the latter, I don't see anywhere that it uses cycle in there.

cybojanek · 2021-11-14T22:17:07Z

I also meant to ask what the compile issue you were having with the sha256 avx2 branch is.

Sorry I missed this message earlier.

You can try to fix it on this branch: https://github.com/cybojanek/zfs/tree/sha256_avx2

To keep track of diffs to linux source asm you can do:

wget -c "https://raw.githubusercontent.com/torvalds/linux/v5.14/arch/x86/crypto/sha256-avx2-asm.S"
diff -u sha256-avx2-asm.S module/icp/asm-x86_64/sha2/sha256_avx2.S

- Add HAVE_SHA compiler define - Add zfs_sha_available function - Detect SHA in cpu feature bits Signed-off-by: Jan Kasiak <[email protected]>

Signed-off-by: Jan Kasiak <[email protected]>

cybojanek · 2021-11-15T01:57:10Z

Some performance numbers using an EC2 m6i.xlarge instance

echo x86_64 > /sys/module/icp/parameters/icp_sha256_impl

modprobe brd rd_nr=1 rd_size=$((12288 * 1024))

zpool create -f -o ashift=12 \
    -O acltype=posixacl \
    -O relatime=on \
    -O xattr=sa \
    -O dnodesize=legacy \
    -O normalization=formD \
    -O devices=off \
    -O compression=off \
    -O checksum=sha256 \
    zscratch /dev/ram0

dd if=/dev/urandom of=/zscratch/data.bin bs=1M count=12000 status=progress conv=fdatasync
zpool export zscratch

for X in generic x86_64 sha-avx sha-ssse3 sha-ni; do
        echo $X > /sys/module/icp/parameters/icp_sha256_impl
        sleep 1
        cat /sys/module/icp/parameters/icp_sha256_impl

        zpool import zscratch
        echo ""
        dd if=/zscratch/data.bin of=/dev/null bs=1M status=progress
        zpool export zscratch
done

cycle fastest [generic] x86_64 sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 27.6342 s, 433 MB/s

cycle fastest generic [x86_64] sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 21.6019 s, 553 MB/s

cycle fastest generic x86_64 [sha-avx] sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.1928 s, 657 MB/s

cycle fastest generic x86_64 sha-avx [sha-ssse3] sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.7719 s, 637 MB/s

cycle fastest generic x86_64 sha-avx sha-ssse3 [sha-ni]
11952848896 bytes (12 GB, 11 GiB) copied, 6.21747 s, 1.9 GB/s

I also did a similar thing with scrub: zpool export, change algorithm, zpool import, zpool scrub:

x86_64
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Mon Nov 15 01:52:34 2021

sha-avx
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Mon Nov 15 01:53:35 2021

sha-ssse3
  scan: scrub repaired 0B in 00:00:19 with 0 errors on Mon Nov 15 01:54:30 2021

sha-ni
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Mon Nov 15 01:55:03 2021

cybojanek · 2021-11-21T17:59:41Z

Closing this in favor of openzfs#12549

cybojanek force-pushed the sha_extensions branch from f4ad482 to 4b950c2 Compare September 6, 2021 18:31

cybojanek commented Sep 6, 2021

View reviewed changes

cybojanek force-pushed the sha_extensions branch from 4b950c2 to 55ff508 Compare September 6, 2021 18:59

cybojanek commented Sep 6, 2021

View reviewed changes

module/icp/asm-x86_64/sha2/sha256_avx2.S Outdated Show resolved Hide resolved

cybojanek commented Sep 6, 2021

View reviewed changes

module/icp/algs/impl/impl.c Show resolved Hide resolved

cybojanek force-pushed the sha_extensions branch from 55ff508 to 3c53081 Compare September 8, 2021 12:56

cybojanek mentioned this pull request Sep 9, 2021

Detect SHA extensions openzfs/zfs#12549

Closed

13 tasks

cybojanek force-pushed the sha_extensions branch from 3c53081 to c2a6d9a Compare September 13, 2021 01:00

cybojanek force-pushed the sha_extensions branch from c2a6d9a to 40ea8ee Compare October 17, 2021 17:23

cybojanek closed this Oct 17, 2021

cybojanek reopened this Oct 17, 2021

cybojanek force-pushed the sha_extensions branch from 40ea8ee to 89e103b Compare October 17, 2021 17:48

cybojanek force-pushed the sha_extensions branch from 89e103b to 14831a8 Compare November 14, 2021 02:19

rincebrain reviewed Nov 14, 2021

View reviewed changes

module/icp/algs/impl/impl.c Outdated Show resolved Hide resolved

cybojanek force-pushed the sha_extensions branch from 14831a8 to 3f0059d Compare November 14, 2021 17:27

cybojanek force-pushed the sha_extensions branch from 3f0059d to d2df809 Compare November 15, 2021 01:20

cybojanek added 4 commits November 14, 2021 20:21

Detect SHA extensions

8d2f674

- Add HAVE_SHA compiler define - Add zfs_sha_available function - Detect SHA in cpu feature bits Signed-off-by: Jan Kasiak <[email protected]>

Add icp algorithm implementation selector

04581cb

Signed-off-by: Jan Kasiak <[email protected]>

Initialize sha2 module with existing implementations

ed2c935

Signed-off-by: Jan Kasiak <[email protected]>

Add SHA256 and SHA512 fast implementations

948739b

Signed-off-by: Jan Kasiak <[email protected]>

cybojanek force-pushed the sha_extensions branch from d2df809 to 948739b Compare November 15, 2021 01:58

cybojanek closed this Nov 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SHA256 fast implementations #1

Add SHA256 fast implementations #1

cybojanek commented Sep 6, 2021 •

edited

Loading

rincebrain commented Sep 9, 2021

cybojanek commented Oct 17, 2021

cybojanek commented Oct 17, 2021

jumbi77 commented Nov 13, 2021

cybojanek commented Nov 13, 2021

rincebrain commented Nov 13, 2021

cybojanek commented Nov 14, 2021

cybojanek commented Nov 14, 2021 •

edited

Loading

rincebrain commented Nov 14, 2021 via email

cybojanek commented Nov 14, 2021

rincebrain commented Nov 14, 2021 •

edited

Loading

rincebrain commented Nov 14, 2021

rincebrain commented Nov 14, 2021

cybojanek commented Nov 14, 2021

rincebrain commented Nov 14, 2021 •

edited

Loading

cybojanek commented Nov 14, 2021

cybojanek commented Nov 15, 2021

cybojanek commented Nov 21, 2021

Add SHA256 fast implementations #1

Add SHA256 fast implementations #1

Conversation

cybojanek commented Sep 6, 2021 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

TODO

rincebrain commented Sep 9, 2021

cybojanek commented Oct 17, 2021

cybojanek commented Oct 17, 2021

jumbi77 commented Nov 13, 2021

cybojanek commented Nov 13, 2021

rincebrain commented Nov 13, 2021

cybojanek commented Nov 14, 2021

cybojanek commented Nov 14, 2021 • edited Loading

rincebrain commented Nov 14, 2021 via email

cybojanek commented Nov 14, 2021

rincebrain commented Nov 14, 2021 • edited Loading

rincebrain commented Nov 14, 2021

rincebrain commented Nov 14, 2021

cybojanek commented Nov 14, 2021

rincebrain commented Nov 14, 2021 • edited Loading

cybojanek commented Nov 14, 2021

cybojanek commented Nov 15, 2021

cybojanek commented Nov 21, 2021

cybojanek commented Sep 6, 2021 •

edited

Loading

cybojanek commented Nov 14, 2021 •

edited

Loading

rincebrain commented Nov 14, 2021 •

edited

Loading

rincebrain commented Nov 14, 2021 •

edited

Loading