Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect SHA extensions #12549

Closed
wants to merge 7 commits into from
Closed

Conversation

cybojanek
Copy link

@cybojanek cybojanek commented Sep 9, 2021

Detect SHA CPU extensions

Motivation and Context

Detect and use SHA / vector CPU extensions in order to optimize checksum calculations.

Description

  • Add SHA extension detection
  • Add icp algorithm selector
  • Use selector with existing sha2 code
  • Add fast implementations

How Has This Been Tested?

  • Compiled on ArchLinux

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@cybojanek
Copy link
Author

Hi - I see some of CI is failing.

I don't think these failures are related to my changes, since this code is not even used at runtime.

Please tell me if I should look into the failures.

@jumbi77
Copy link
Contributor

jumbi77 commented Sep 18, 2021

@cybojanek Awesome, really looking forward for the referenced improvements.

I just want to link some old but never finished PRs related to SHA (not sure if they still apply to current OpenZFS though):

Multi-buffer sha256 support in SPL to ZFS (openzfs/spl#646)
sha256 x86_64 optimization v2 (#2351)

@cybojanek
Copy link
Author

Thanks for the links to the previous issues.

Just posting here that I'm still working on this issue.

(No ETA - busy with family/work)

@tonynguien
Copy link
Contributor

Hi - I see some of CI is failing.

I don't think these failures are related to my changes, since this code is not even used at runtime.

Please tell me if I should look into the failures.

Thanks for working on this,

The test failures are known failures, i.e. unrelated to your changes. I'll look into the build failure some more.

@tonynguien tonynguien added the Status: Work in Progress Not yet ready for general review label Sep 30, 2021
@tonynguien
Copy link
Contributor

tonynguien commented Sep 30, 2021

Thanks for the links to the previous issues.

Just posting here that I'm still working on this issue.

(No ETA - busy with family/work)

I labeled the PR as "Work in Progress" and will update status once you give the go.

@cybojanek cybojanek force-pushed the detect_sha_extensions branch from 5f165a6 to 948739b Compare November 15, 2021 01:58
@cybojanek
Copy link
Author

Some performance numbers using an EC2 m6i.xlarge instance

echo x86_64 > /sys/module/icp/parameters/icp_sha256_impl

modprobe brd rd_nr=1 rd_size=$((12288 * 1024))

zpool create -f -o ashift=12 \
    -O acltype=posixacl \
    -O relatime=on \
    -O xattr=sa \
    -O dnodesize=legacy \
    -O normalization=formD \
    -O devices=off \
    -O compression=off \
    -O checksum=sha256 \
    zscratch /dev/ram0

dd if=/dev/urandom of=/zscratch/data.bin bs=1M count=12000 status=progress conv=fdatasync
zpool export zscratch

for X in generic x86_64 sha-avx sha-ssse3 sha-ni; do
        echo $X > /sys/module/icp/parameters/icp_sha256_impl
        sleep 1
        cat /sys/module/icp/parameters/icp_sha256_impl

        zpool import zscratch
        echo ""
        dd if=/zscratch/data.bin of=/dev/null bs=1M status=progress
        zpool export zscratch
done
cycle fastest [generic] x86_64 sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 27.6342 s, 433 MB/s

cycle fastest generic [x86_64] sha-avx sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 21.6019 s, 553 MB/s

cycle fastest generic x86_64 [sha-avx] sha-ssse3 sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.1928 s, 657 MB/s

cycle fastest generic x86_64 sha-avx [sha-ssse3] sha-ni
11952848896 bytes (12 GB, 11 GiB) copied, 18.7719 s, 637 MB/s

cycle fastest generic x86_64 sha-avx sha-ssse3 [sha-ni]
11952848896 bytes (12 GB, 11 GiB) copied, 6.21747 s, 1.9 GB/s

I also did a similar thing with scrub: zpool export, change algorithm, zpool import, zpool scrub:

x86_64
  scan: scrub repaired 0B in 00:00:21 with 0 errors on Mon Nov 15 01:52:34 2021

sha-avx
  scan: scrub repaired 0B in 00:00:18 with 0 errors on Mon Nov 15 01:53:35 2021

sha-ssse3
  scan: scrub repaired 0B in 00:00:19 with 0 errors on Mon Nov 15 01:54:30 2021

sha-ni
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Mon Nov 15 01:55:03 2021
root@ip-172-31-40-243:/home/ubuntu/zfs# cat /proc/spl/kstat/zfs/sha256_bench
4 0 0x01 -1 0 1431694018884 3006832037321
implementation   bytes/second   
fastest          1336724300     
generic          239174080      
x86_64           330923150      
sha-avx          326477949      
sha-ssse3        323231027      
sha-ni           1336724300     
root@ip-172-31-40-243:/home/ubuntu/zfs# cat /proc/spl/kstat/zfs/sha512_bench
5 0 0x01 -1 0 1431706970891 3009816103445
implementation   bytes/second   
fastest          570998846      
generic          365103572      
x86_64           493050536      
sha-avx          516670214      
sha-avx2         570998846      
sha-ssse3        474291080      
root@ip-172-31-40-243:/home/ubuntu/zfs# 

@cybojanek
Copy link
Author

@tonynguien Hi! I think this is ready for review.

@rincebrain Helped me fix a few things in the initial review here cybojanek#1

- Add HAVE_SHA compiler define
- Add zfs_sha_available function
- Detect SHA in cpu feature bits

Signed-off-by: Jan Kasiak <[email protected]>
@cybojanek cybojanek force-pushed the detect_sha_extensions branch from 948739b to 733860f Compare November 21, 2021 18:31
@AndyLavr
Copy link

AndyLavr commented Nov 27, 2021

Hey,

Nov 25 12:41:52 wip kernel: [    8.713282] CFI failure (target: sha256_avx_transform+0x0/0x8 [icp]):
Nov 25 12:41:52 wip kernel: [   13.354139] CFI failure (target: sha256_ssse3_transform+0x0/0x8 [icp]):

Please view:

ICP: Add missing stack frame info to SHA asm files

CFI directives

@cybojanek
Copy link
Author

Hey,

Nov 25 12:41:52 wip kernel: [    8.713282] CFI failure (target: sha256_avx_transform+0x0/0x8 [icp]):
Nov 25 12:41:52 wip kernel: [   13.354139] CFI failure (target: sha256_ssse3_transform+0x0/0x8 [icp]):

Please view:

ICP: Add missing stack frame info to SHA asm files

CFI directives

How did you see those warning messages - do they just show up when you load the module?

@AndyLavr
Copy link

AndyLavr commented Nov 28, 2021

How did you see those warning messages - do they just show up when you load the module?

Boot process, dmesg info. I`m build the Linux kernel with Clang + LTO + CFI. Debug from Control-Flow Integrity (CFI).

[8.084305] ------------[ cut here ]------------
[8.084335] CFI failure (target: sha256_avx_transform+0x0/0x8 [icp]):
[8.084385] WARNING: CPU: 4 PID: 357 at kernel/cfi.c:29 __ubsan_handle_cfi_check_fail+0x31/0x40
[8.084417] Modules linked in: icp(+) zzstd zcommon znvpair spl zlib_deflate amdgpu iommu_v2 gpu_sched drm_ttm_helper ttm i2c_algo_bit drm_kms_helper cec sysimgblt syscopyarea sysfillrect aesni_intel fb_sys_fops crypto_simd psmouse input_leds cryptd serio_raw drm wmi video mac_hid
[8.085710] CPU: 4 PID: 357 Comm: modprobe Tainted: G        W         5.16.0-generic #20211125 6122db310810d441500a2ad55360e9f50df200be
[8.086967] Hardware name: Dell Inc. Precision M6600/04YY4M, BIOS A18 09/14/2018
[8.088225] RIP: 0010:__ubsan_handle_cfi_check_fail+0x31/0x40
[8.089485] Code: 89 f3 48 c7 c7 00 00 05 93 48 c7 c6 ac 05 ee 8c e8 34 2d 56 00 85 c0 75 02 5b c3 48 c7 c7 f3 23 e6 8c 48 89 de e8 5f c2 e1 ff <0f> 0b 5b c3 00 00 cc cc 00 00 cc cc 00 00 cc 0f 1f 44 00 00 c3 00
[8.090823] RSP: 0018:ffffb72d8143ba20 EFLAGS: 00010246
[8.092206] RAX: 9f33b67afe331200 RBX: ffffffffc114f750 RCX: 0000000000000002
[8.093598] RDX: ffffb72d8143b8d0 RSI: 0000000000000004 RDI: 00000000ffffffff
[8.094969] RBP: ffffb72d8143bca8 R08: ffffffff90f90000 R09: 0000000000000000
[8.096318] R10: 00000000ffffdfff R11: 00000000ffffffff R12: ffffffffc114f510
[8.097680] R13: ffffffffc11a8c78 R14: ffff8a3eb4ff0000 R15: ffffffffc114f750
[8.099041] FS:  00007f6d523a4b80(0000) GS:ffff8a415db00000(0000) knlGS:0000000000000000
[8.100411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[8.101786] CR2: 00007ffca8385508 CR3: 0000000169b44004 CR4: 00000000000606e0
[8.103174] Call Trace:
[8.104556]  <TASK>
[8.105934]  sha256_alg_impl_benchmark+0xa4/0xb0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.107364]  alg_impl_init+0x2b0/0x4b0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.108777]  ? _raw_spin_unlock+0x12/0x30
[8.110166]  ? kcf_do_notify+0xe1/0x110 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.111579]  ? crypto_register_provider+0x69c/0x6f0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.112995]  ? crypto_register_provider+0x69c/0x6f0 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.114394]  ? _raw_write_lock+0x13/0x30
[8.115767]  ? _raw_write_unlock+0x12/0x30
[8.117127]  ? proc_register+0x19a/0x1b0
[8.118487]  ? 0xffffffffc1133000
[8.119830]  ? rcu_nmi_exit+0x1f/0x80
[8.121166]  ? rcu_irq_exit_irqson+0x2d/0x60
[8.122496]  ? sha512_ssse3_will_work+0x8/0x8 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.123853]  ? crypto_digest_init+0x10/0x10 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.125174]  sha2_mod_init+0x12/0x60 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.126457]  init_module+0x2f/0x1000 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[8.127715]  do_one_initcall+0xa7/0x260
[8.128952]  do_init_module+0x5a/0x230
[8.130150]  load_module+0x196d/0x1ad0
[8.131306]  ? __x64_sys_rmdir+0x8/0x8
[8.132446]  __x64_sys_finit_module+0xad/0xe0
[8.133588]  do_syscall_64+0x93/0x130
[8.134729]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[8.135871] RIP: 0033:0x7f6d524c794d
[8.137012] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b3 64 0f 00 f7 d8 64 89 01 48
[8.138226] RSP: 002b:00007ffca8389598 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[8.139439] RAX: ffffffffffffffda RBX: 00005622d7f9bab0 RCX: 00007f6d524c794d
[8.140652] RDX: 0000000000000000 RSI: 00005622d5fd9c12 RDI: 0000000000000008
[8.141853] RBP: 0000000000060000 R08: 0000000000000000 R09: 0000000000000002
[8.143063] R10: 0000000000000008 R11: 0000000000000246 R12: 00005622d5fd9c12
[8.144277] R13: 00005622d7fa34a0 R14: 00005622d7fa2c20 R15: 00005622d7f9ca70
[8.145496]  </TASK>
[8.146708] ---[ end trace 302011d136109a85 ]---
[8.148064] ------------[ cut here ]------------
[13.354137] ------------[ cut here ]------------
[13.354139] CFI failure (target: sha256_ssse3_transform+0x0/0x8 [icp]):
[13.354160] WARNING: CPU: 7 PID: 1545 at kernel/cfi.c:29 __ubsan_handle_cfi_check_fail+0x31/0x40
[13.354165] Modules linked in: intel_rapl_msr hid_generic dell_rbtn at24 intel_rapl_common dell_laptop x86_pkg_temp_thermal intel_powerclamp dell_smm_hwmon coretemp dell_wmi sparse_keymap crct10dif_pclmul snd_hda_codec_idt crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel iwldvm ledtrig_audio snd_hda_codec_hdmi rapl dell_smbios mac80211 intel_cstate dcdbas snd_hda_intel libarc4 joydev wmi_bmof snd_intel_dspcfg dell_wmi_descriptor usbhid snd_intel_sdw_acpi iwlwifi i2c_i801 hid snd_hda_codec i2c_smbus sdhci_pci snd_hda_core mei_me cqhci cfg80211 sdhci mei snd_hwdep dell_smo8800 sch_fq tcp_htcp msr parport_pc ppdev parport ip_tables x_tables zfs zlua zunicode zavl icp zzstd zcommon znvpair spl zlib_deflate amdgpu iommu_v2 gpu_sched drm_ttm_helper ttm i2c_algo_bit drm_kms_helper cec sysimgblt syscopyarea sysfillrect aesni_intel fb_sys_fops crypto_simd psmouse input_leds cryptd serio_raw drm wmi video mac_hid
[13.354217] CPU: 7 PID: 1545 Comm: z_null_int Tainted: G        W         5.16.0-generic #20211125 6122db310810d441500a2ad55360e9f50df200be
[13.354220] Hardware name: Dell Inc. Precision M6600/04YY4M, BIOS A18 09/14/2018
[13.354221] RIP: 0010:__ubsan_handle_cfi_check_fail+0x31/0x40
[13.354224] Code: 89 f3 48 c7 c7 00 00 05 93 48 c7 c6 ac 05 ee 8c e8 34 2d 56 00 85 c0 75 02 5b c3 48 c7 c7 f3 23 e6 8c 48 89 de e8 5f c2 e1 ff <0f> 0b 5b c3 00 00 cc cc 00 00 cc cc 00 00 cc 0f 1f 44 00 00 c3 00
[13.354225] RSP: 0018:ffffb72d8f11f930 EFLAGS: 00010246
[13.354227] RAX: 4b2119f55d67eb00 RBX: ffffffffc114f748 RCX: 0000000000000001
[13.354229] RDX: ffffffff8c20a9f6 RSI: ffffffff8cede298 RDI: 00000000ffffffff
[13.354230] RBP: ffffffffc114f748 R08: ffffffff90f90000 R09: 0000000000000000
[13.354231] R10: 00000000ffffdfff R11: 00000000ffffffff R12: 0000000000000700
[13.354232] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000001c000
[13.354234] FS:  0000000000000000(0000) GS:ffff8a415dbc0000(0000) knlGS:0000000000000000
[13.354236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13.354237] CR2: 000055de9c41a2ac CR3: 0000000262e0c006 CR4: 00000000000606e0
[13.354238] Call Trace:
[13.354240]  <TASK>
[13.354242]  SHA2Update+0x2c6/0x320 [icp a234ac88fc385150d04519757fcc356c823dd90b]
[13.354256]  sha_incremental+0x16/0x20 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354378]  abd_iterate_func+0x18d/0x280 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354476]  ? raidz_mul_abd_cb+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354574]  ? abd_checksum_off+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354671]  abd_checksum_SHA256+0x85/0xf0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354771]  zio_checksum_error_impl+0x4a4/0x6c0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354869]  ? resched_curr+0x24/0xf0
[13.354874]  ? ttwu_do_wakeup+0x32/0x1d0
[13.354877]  ? ttwu_queue+0xb6/0x130
[13.354880]  ? vdev_queue_io_to_issue+0x27e/0xbf0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.354978]  ? dmu_object_set_blocksize+0x10/0x10 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355075]  zio_checksum_error+0x88/0xd0 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355173]  zio_checksum_verify+0x9a/0x190 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355272]  ? zio_vdev_io_done+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355382]  zio_execute+0xc2/0x330 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355544]  ? raidz_syn_pq_abd+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355701]  taskq_thread+0x402/0x6b0 [spl 4890292e51f285e1e5ab30321e8e467563d4ed5f]
[13.355726]  ? zone_get_hostid+0x10/0x10 [spl 4890292e51f285e1e5ab30321e8e467563d4ed5f]
[13.355744]  ? raidz_syn_pq_abd+0x8/0x8 [zfs 5d223ae1b610bdfd982b6b96f693da01eb3e4c18]
[13.355922]  ? crgetgroups+0x10/0x10 [spl 4890292e51f285e1e5ab30321e8e467563d4ed5f]
[13.355941]  kthread+0x1a4/0x1e0
[13.355946]  ? io_wqe_worker+0x8/0x8
[13.355953]  ret_from_fork+0x22/0x30
[13.355959]  </TASK>
[13.355960] ---[ end trace 302011d136109a8f ]---
[13.356592] ------------[ cut here ]------------

@solbjorn
Copy link
Contributor

solbjorn commented Dec 2, 2021

@AndyLavr @cybojanek, that's because SHA-{256,512} implementations are getting casted:

sha256_impl = (sha256_block_f)(ops->ctx);
sha512_impl = (sha512_block_f)(ops->ctx);

Function/callback casts are not allowed with ClangCFI as they are considered as attacks. I'd say they should be avoided in general, there's always a way to get strict matches.

(just BTW, I have a commit here in ZFS repo where I fixed all function casts found with ZTS: 23c13c7)

@AndyLavr
Copy link

AndyLavr commented Dec 2, 2021

@solbjorn

(just BTW, I have a commit here in ZFS repo where I fixed all function casts found with ZTS: 23c13c7)

You're right. Thanks! :)

@adamdmoss
Copy link
Contributor

I gave this branch a spin on my i7-3770 ... I guess it coulda gone better. 🥲

Benchmark numbers look fine, this CPU doesn't have NI but sse/avx are a pretty good win:

/proc/spl/kstat/zfs/sha256_bench:4 0 0x01 -1 0 118,315,404,492 536,989,801,750
/proc/spl/kstat/zfs/sha256_bench:implementation   bytes/second   
/proc/spl/kstat/zfs/sha256_bench:fastest          320,043,365      
/proc/spl/kstat/zfs/sha256_bench:generic          203,498,602      
/proc/spl/kstat/zfs/sha256_bench:x86_64           286,613,247      
/proc/spl/kstat/zfs/sha256_bench:sha-avx          243,893,112      
/proc/spl/kstat/zfs/sha256_bench:sha-ssse3        320,043,365      
/proc/spl/kstat/zfs/sha512_bench:5 0 0x01 -1 0 118,325,357,366 536,989,817,440
/proc/spl/kstat/zfs/sha512_bench:implementation   bytes/second   
/proc/spl/kstat/zfs/sha512_bench:fastest          481,871,280      
/proc/spl/kstat/zfs/sha512_bench:generic          339,002,689      
/proc/spl/kstat/zfs/sha512_bench:x86_64           438,785,950      
/proc/spl/kstat/zfs/sha512_bench:sha-avx          481,871,280      
/proc/spl/kstat/zfs/sha512_bench:sha-ssse3        468,769,935

However, stability is a real problem; while heavily accessing my datasets which use SHA512, I get:

  • intermittent I/O errors on the dataset
  • intermittent crashes in other userspace processes which don't touch these datasets, i.e.:
[  384.071162] traps: python3[62580] trap stack segment ip:517df2 sp:7ffe09d50220 error:0 i
n python3.8[423000+295000]
[  410.817533] traps: depmod[68016] general protection fault ip:7f674e271870 sp:7fffccefcc1
0 error:0 in libc-2.31.so[7f674e1f9000+178000]
[  425.750946] traps: sed[71836] general protection fault ip:557320257384 sp:7fff34cf7f40 e
rror:0 in sed[55732024d000+13000]

@adamdmoss
Copy link
Contributor

FWIW, I'm using a PREEMPT kernel: Linux version 5.8.0-59-lowlatency (buildd@lcy01-amd64-022) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #66~20.04.1-Ubuntu SMP PREEMPT Thu Jun 17 13:03:02 UTC 2021

vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
stepping	: 9
microcode	: 0x21
cpu MHz		: 1706.813
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips	: 6784.41
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual

My amateur guess as to the cause of the problem is that it's the same issue that affected the zfs x64 crypto code; I vaguely recall it's something like, Linux doesn't let non-GPL modules save and restore FPU/spicy regs across context switches, and ZFS' workaround for the crypto code was to explicitly forbid pre-emption for the duration of the crypto code...

@AttilaFueloep
Copy link
Contributor

My amateur guess as to the cause of the problem is that it's the same issue that affected the zfs x64 crypto code;

Your guess seems right. Since gcc uses SIMD instructions for optimizing plain C code, failing to preserve the FPU state affects all kind of software, not just the ones doing float calculations. So your modprobe and python fails are lining up with your guess.

ZFS' workaround for the crypto code was to explicitly forbid pre-emption for the duration of the crypto code

Close but not quite. GPL modules can tell the kernel to save and restore FPU state on context switches, meaning the overhead only takes place there. Bering CDDL we can't. On any FPU use we have to disable preemption and save the FPU state and do the reverse when we're done. This incurs quite a big overhead, making the use of SIMD instruction considerably less effective, especially if there are only a few.

* The non-indented lines are instructions related to the message schedule.
*
* void sha256_ni_transform(uint32_t *digest, const void *data,
uint32_t numBlocks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing continuation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by this.

What do you want it to look like?

This is a comment from the original Intel source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised cstyle isn't up in arms about this (okay, less so since it's in a .S):

 * The non-indented
 *
 * void sha256_ni_transform(uint32_t *digest, const void *data,
 *     uint32_t numBlocks);

L90 is missing the block comment continuation and has a double-tab continuation instead (as opposed to four spaces after the comment *)

uint64_t run_count = 0;
uint64_t start, run_time_ns;

kpreempt_disable();
Copy link
Contributor

@rincebrain rincebrain Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this (and calls to them in general) need a kfpu_begin()/kfpu_end() wrapper if you're calling out into SIMD/FPU instructions, or you can wind up with problems like @adamdmoss is having. (if you have one and it somehow didn't show up in my grep of the patch, my apologies.)

See, for example, how the GCM code calls it before calling accelerated instructions, or the fletcher or raidz parity code likewise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K - I added the wrapping in dfa1f42

Should I remove the calls to kpreempt_disable - if I use kfpu_begin instead will that be ok?

I got the kpreempt_disable from module/zcommon/zfs_fletcher.c where they do benchmarking.

Copy link
Contributor

@rincebrain rincebrain Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check, the definition of kfpu_begin is basically "kpreempt_disable(); [save FPU state]", so you should be good doing that.

(Yes I know there's multiple this is just one example.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - now I see it.

I removed the redundant preempt calls from the benchmark portion.

Thanks!

Copy link
Contributor

@nabijaczleweli nabijaczleweli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes on (a) const correctness and (b) aggressively consting actually constant data (cf. #12899). Beside the function argument ones, these repeat for the other (512) impls.

module/icp/algs/impl/impl.c Outdated Show resolved Hide resolved
module/icp/algs/impl/impl.c Outdated Show resolved Hide resolved
module/icp/algs/sha2/sha2.c Show resolved Hide resolved
module/icp/algs/sha2/sha2.c Outdated Show resolved Hide resolved
module/icp/algs/sha2/sha2.c Outdated Show resolved Hide resolved
@cybojanek cybojanek force-pushed the detect_sha_extensions branch 2 times, most recently from dfa1f42 to 5db5df2 Compare January 12, 2022 04:23
@adamdmoss
Copy link
Contributor

Stability appears good now, thanks. I'll keep an eye on it.

@RJVB
Copy link

RJVB commented Jan 17, 2022

Close but not quite. GPL modules can tell the kernel to save and restore FPU state on context switches, meaning the overhead only takes place there. Bering CDDL we can't.

On a 5+ kernel that doesn't have the required functions patched back in (like the Liquorix kernel builds), right?

@adamdmoss
Copy link
Contributor

Is this ready to go or is it still a WIP?

@mcmilk
Copy link
Contributor

mcmilk commented Apr 24, 2022

When BLAKE3 is in, I would like to add some additional sha256 and sha512 SIMD code.
I am currently working on public domain code for choosing the implementation for Intel x86-64, PPC64 and aarch64 architectures....

@rincebrain
Copy link
Contributor

Github seems to have blackholed my email reply, but:
If you'd like to go poke around some more sources of implementations (since more is always better, right), Intel has their implementations of various crypto primitives for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 licenses over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto

@mcmilk
Copy link
Contributor

mcmilk commented Apr 25, 2022

Github seems to have blackholed my email reply, but: If you'd like to go poke around some more sources of implementations (since more is always better, right), Intel has their implementations of various crypto primitives for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 licenses over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto

Yes I know these sources, but I searched for public domain code in the first place ... and will implement SSE2 CC0 code ... which can then be reused on PPC, AARCH64 and so on via SIMDE ...
But of cause, the Intel AVX2+AVX512 MIT code will be used also.

@cybojanek cybojanek force-pushed the detect_sha_extensions branch from 5db5df2 to 8969a46 Compare April 26, 2022 00:36
@cybojanek
Copy link
Author

Glad to see some activity here!

I pushed out a one line change I forgot to push out a while ago.

@jumbi77
Copy link
Contributor

jumbi77 commented Aug 5, 2022

@cybojanek Can you may rebase this and give an status update? Or is this PR put behind because of mcmilk mentioned work here? Much thanks anyway.

@mcmilk
Copy link
Contributor

mcmilk commented Aug 5, 2022

@jumbi77
I started a RFC pull request - you could give it a try: #13741
But please don't put any importand data onto it... it's just a beginning... therefore the RFC.

@cybojanek
Copy link
Author

@cybojanek Can you may rebase this and give an status update? Or is this PR put behind because of mcmilk mentioned work here? Much thanks anyway.

@mcmilk How does your branch compare to this one?

At a quick glance, it looks like you forked off of master, but also have some new code for generalizing the impl stuff?

Is it only for freebsd? Or also for Linux? Does it do benchmarking?

I don't want to duplicate work nor effort.

@mcmilk
Copy link
Contributor

mcmilk commented Aug 8, 2022

My branch does the same for bsd and linux... it's not ready, but I will also try to include the hardware specific impls. of freebsd as well...
The intel code you have used, is always a bit slower, so I used openssl. But my branch isn't finished currently. Generic x86-64 and armv4 code needs work... and also the testing of all the changes....

Edit: the first commit of my branch removes ALL old SHA2 stuff... and re-implements the generic function with public domain code. The old Sun/Solaris impl. is history then. The generic code is faster and smaller.
But I don't know what the OpenZFS team thinks about this, therefore I firstly started this RFC ... when they say, it generelly is okay... then I will fix the remaining issues.

@rincebrain
Copy link
Contributor

rincebrain commented Oct 11, 2022 via email

@mcmilk
Copy link
Contributor

mcmilk commented Oct 11, 2022

If you'd like to go poke around some more sources of implementations, Intel has their implementations for all sorts of subsets of x86 and i think aarch64 in a few places under BSD-3 and Apache-2 over at https://github.com/intel/intel-ipsec-mb and https://github.com/intel/ipp-crypto

On Sun, Apr 24, 2022 at 1:03 PM Tino Reichardt @.> wrote: When BLAKE3 <#12918> is in, I would like to add some additional sha256 and sha512 SIMD code. I am currently working https://github.com/mcmilk/sha2-testing on public domain code for choosing the implementation for Intel x86-64, PPC64 and aarch64 architectures.... — Reply to this email directly, view it on GitHub <#12549 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7PQ4GR7FMBWTRU7H3DVGV5GVANCNFSM5DV5WUXA . You are receiving this because you were mentioned.Message ID: @.>

Are these not okay?
sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64)
sha512-x86_64.S: x64, AVX, AVX2 (x86_64)
sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm)
sha512-armv7.S: ARMv7, NEON (arm)
sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64)
sha512-armv8.S: ARMv7, ARMv8-CE (aarch64)
sha256-ppc.S: Generic PPC64 LE/BE (ppc64)
sha512-ppc.S: Generic PPC64 LE/BE (ppc64)
sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)
sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64)

They are all ready seem to work nice - #13741.

@mcmilk
Copy link
Contributor

mcmilk commented Mar 5, 2023

@cybojanek - we can maybe close this pull request?

You can see here: https://github.com/mcmilk/sha2-testing - why I have preferred the openssl variants over the Intel ones.

When you try out the current master branch, you can re-check the benchmarks via cat /proc/spl/kstat/zfs/chksum_bench.

@cybojanek
Copy link
Author

I'm glad something got merged in.

Looking forward to seeing it propagate to my distro :D

@cybojanek cybojanek closed this Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.