Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads from ZFS volumes cause system instability when SIMD acceleration is enabled #9346

Closed
aerusso opened this issue Sep 22, 2019 · 88 comments
Closed
Assignees
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@aerusso
Copy link
Contributor

aerusso commented Sep 22, 2019

System information

I'm duplicating Debian bug report 940932. Because of the severity of the bug report (claims data corruption), I'm directly posting it here before trying to confirm with the original poster. If this is inappropriate, I apologize, and please close the bug report.

Type Version/Name
Distribution Name Debian
Distribution Version stable
Linux Kernel 4.19.67
Architecture amd64 (Ryzen 5 2600X and Ryzen 5 2600 on X470 GAMING PLUS (MS-7B79) BIOS version: 7B79vAC)
ZFS Version zfs-linux/0.8.1-4~bpo10+1

Describe the problem you're observing

Rounding error failure in mprime torture test that goes away when
/sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl are set to scalar.

Describe how to reproduce the problem

Quoting the bug report:

recently I have noticed some instability on one of my machines.
The mprime (https://www.mersenne.org/download/) Torture Tests would occasionaly show errors like

"FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file."

random commands would occasionaly segfault.

While trying to narrow down the problem I have replaced the PSU, RAM and the CPU. Multiple hour long runs of memtest86 did not show any problem.

Finally I was able to narrow down the reads from ZFS volumes as the trigger for the instability.
Scrubbing the volume would cause mprime to error out especially quickly.

As a workaround I switched the SIMD acceleration off by piping "scalar" to

/sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl

and that made the system stable again.

Include any warning/errors/backtraces from the system logs

mprime:

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.
@rincebrain
Copy link
Contributor

rincebrain commented Sep 22, 2019

We spent a bit of time going back and forth on IRC about this, and it seems that only the scalar setting makes the problem go away.

@alex-gh
Copy link

alex-gh commented Sep 23, 2019

An update from the original thread:

A quick update:

I have booted up the Debian live USB on another machine and was able to
reproduce this bug with it.

The machine had the Ryzen 5 2600 CPU (the one I swapped with the machine
I have originally found the problem on).

The Mainboard is: ASUS PRIME B350-PLUS
BIOS Version: 5216

Output of uname -a:
Linux debian 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64
GNU/Linux

Output of zfs --version:
zfs-0.8.1-4bpo10+1
zfs-kmod-0.8.1-4
bpo10+1

Also here are the steps I'm taking to reproduce the problem:

  • Start mprime for linux 64 bit
  • Select Torture Test
  • Choose 12 torture test threads in case of ryzen 5 (default setting)
  • Select Test (2) Small FFT
  • All other settings are set to default settings
  • Run the test
  • Read data from zfs by either reading a large file or starting a scrub.
    (raidz scrubs are escpecially effective)

Within a few seconds you should see mprime reporting errors.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 23, 2019
@behlendorf
Copy link
Contributor

@aerusso thank you for bringing this to our attention. The reported symptoms are consistent with what we'd expect if the fpu registered were someone not being restored. We'll see if we can reproduce the issue locally using the 4.19 kernel and the provided test case. Would it be possible to try and reproduce the issue using a 5.2 or newer kernel?

@rincebrain
Copy link
Contributor

rincebrain commented Sep 24, 2019

Horrifyingly, I can reproduce this in a Debian buster VM on my Intel Xeon-D.

I'm going to guess, since reports of this being on fire haven't otherwise trickled in, there might be a mismerge in Debian, or a missing followup patch?

@alex-gh
Copy link

alex-gh commented Sep 24, 2019

I did a test with a Manjaro live USB and I could not reproduce this behaviour.

Kernel: 5.2.11-1-MANJARO
ZFS package: archzfs/zfs-dkms-git 2019.09.18.r5411.gafc8f0a6f-1

aerusso added a commit to aerusso/zfs that referenced this issue Sep 24, 2019
This is a collection of some of the patches Debian applies to stable.

I am hoping that openzfs#9346 can be triggered by a test here, as that would
both explain why only Debian is able to reproduce the issue, and that
there are already test cases to catch the error.

Patches included:

2000-increase-default-zcmd-allocation-to-256K.patch
linux-5.0-simd-compat.patch
git_fix_mount_race.patch
Fix-CONFIG_X86_DEBUG_FPU-build-failure.patch
3100-remove-libzfs-module-timeout.patch
@tonyhutter tonyhutter mentioned this issue Sep 24, 2019
12 tasks
@ggzengel
Copy link
Contributor

I can reproduce it with kernel 4.19 and stress-ng too.
I get more than 5 errors per minute.

With kernel 5.2 there are no errors.

root# zpool scrub zpool1
root# stress-ng --vecmath 9 --fp-error 9 -vvv --verify --timeout 3600
stress-ng: debug: [20635] 32 processors online, 32 processors configured
stress-ng: info:  [20635] dispatching hogs: 9 vecmath, 9 fp-error
stress-ng: debug: [20635] cache allocate: default cache size: 20480K
<snip>
stress-ng: fail:  [22426] stress-ng-fp-error: exp(DBL_MAX) return was 1.000000 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)
stress-ng: fail:  [22426] stress-ng-fp-error: exp(-1000000.0) return was 1.000000 (expected 0.000000), errno=0 (expected 34), excepts=0 (expected 16)
stress-ng: fail:  [22389] stress-ng-fp-error: log(0.0) return was 51472868343212123638854435100661726861789564087474337372834924821256607581904275443789550923204262543290261262543297927616110435675714711004645013184740565747574812535257726048857959524537318313055909029913182014561534585350486375714439359868335816704.000000 (expected -0.000000), errno=34 (expected 34), excepts=4 (expected 4)
stress-ng: fail:  [22426] stress-ng-fp-error: exp(DBL_MAX) return was 0.000000 (expected inf), errno=0 (expected 34), excepts=8 (expected 8)
stress-ng: fail:  [22407] stress-ng-fp-error: exp(-1000000.0) return was -304425543965041899037761188749362776730427289735837064756329392319501601366578319214648354685850550352787929416219211679117562590779680584744448269412872882932591437212235151179776.000000 (expected 0.000000), errno=0 (expected 34), excepts=16 (expected 16)
stress-ng: fail:  [22397] stress-ng-fp-error: exp(DBL_MAX) return was 1.000315 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)
# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  32
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2399.755
BogoMIPS:            4800.04
Hypervisor vendor:   Xen
Virtualization type: none
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-31
Flags:               fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault intel_ppin ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt

# uname -a
Linux server2 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux

@ThomasLamprecht
Copy link
Contributor

ThomasLamprecht commented Sep 25, 2019

Can confirm this too on 5.0. It seems that the assumption from the SIMD patch, that with 5.0 and 5.1 kernels preemption and local IRQ disabling is enough, is wrong:

For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used. All non-kernel
threads will restore the preserved user FPU state.
-- commit message of commit e5db313

If one checks out the kernel_fpu_{begin,end} methods from 5.0 kernel we can see that those safe the registers also. I can fix this issue by doing so, but my approach was really cumbersome as the "copy_kernel_to_xregs_err", "copy_kernel_to_fxregs_err" and "copy_kernel_to_fregs_err" methods are not avaialble, only those without "_err", but as those use the GPL symboled "ex_handler_fprestore" I cannot use them here.

So for my POC fix I ensured that on begin we always save the fpregs, and for the end always restore, and to do so I just hacked over the functionally of those methods from the 5.3 Kernel:
(note quite minimal hacky change as a POC fix to show the issue)

diff --git a/include/linux/simd_x86.h b/include/linux/simd_x86.h
index 5f243e0cc..08504ba92 100644
--- a/include/linux/simd_x86.h
+++ b/include/linux/simd_x86.h
@@ -179,7 +180,6 @@ kfpu_begin(void)
        preempt_disable();
        local_irq_disable();
 
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
        /*
         * The current FPU registers need to be preserved by kfpu_begin()
         * and restored by kfpu_end().  This is required because we can
@@ -188,32 +188,51 @@ kfpu_begin(void)
         * context switch.
         */
        copy_fpregs_to_fpstate(&current->thread.fpu);
-#elif defined(HAVE_KERNEL_FPU_INITIALIZED)
        /*
         * There is no need to preserve and restore the FPU registers.
         * They will always be restored from the task's stored FPU state
         * when switching contexts.
         */
        WARN_ON_ONCE(current->thread.fpu.initialized == 0);
-#endif
 }
+#ifndef kernel_insn_err
+#define kernel_insn_err(insn, output, input...)                                \
+({                                                                     \
+       int err;                                                        \
+       asm volatile("1:" #insn "\n\t"                                  \
+                    "2:\n"                                             \
+                    ".section .fixup,\"ax\"\n"                         \
+                    "3:  movl $-1,%[err]\n"                            \
+                    "    jmp  2b\n"                                    \
+                    ".previous\n"                                      \
+                    _ASM_EXTABLE(1b, 3b)                               \
+                    : [err] "=r" (err), output                         \
+                    : "0"(0), input);                                  \
+       err;                                                            \
+})
+#endif
+
 
 static inline void
 kfpu_end(void)
 {
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
        union fpregs_state *state = &current->thread.fpu.state;
-       int error;
+       int err = 0;
 
        if (use_xsave()) {
-               error = copy_kernel_to_xregs_err(&state->xsave, -1);
+               u32 lmask = -1;
+               u32 hmask = -1;
+               XSTATE_OP(XRSTOR, &state->xsave, lmask, hmask, err);
        } else if (use_fxsr()) {
-               error = copy_kernel_to_fxregs_err(&state->fxsave);
+               struct fxregs_state *fx = &state->fxsave;
+               if (IS_ENABLED(CONFIG_X86_32))
+                       err = kernel_insn_err(fxrstor %[fx], "=m" (*fx), [fx] "m" (*fx));
+               else
+                       err = kernel_insn_err(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx));
        } else {
-               error = copy_kernel_to_fregs_err(&state->fsave);
+               copy_kernel_to_fregs(&state->fsave);
        }
-       WARN_ON_ONCE(error);
-#endif
+       WARN_ON_ONCE(err);
 
        local_irq_enable();
        preempt_enable();

Related to the removal of the SIMD patch in the (future) 0.8.2 release #9161

@shartge
Copy link

shartge commented Sep 25, 2019

With kernel 5.2 there are no errors.

I can reproduce this with mprime -t on Debian Buster running 5.2.9-2~bpo10+1 and zfs-dkms 0.8.1-4~bpo10+1 after ~1 minute of runtime:

[Worker #1 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #6 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #7 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #8 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #5 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #3 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #2 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:43] FATAL ERROR: Rounding was 4.029914356e+80, expected less than 0.4
[Worker #4 Sep 25 13:43] Hardware failure detected, consult stress.txt file.
[Worker #4 Sep 25 13:43] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #4 Sep 25 13:43] Worker stopped.
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Stepping:            2
CPU MHz:             1201.117
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            6999.89
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            10240K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

@behlendorf
Copy link
Contributor

@GamerSource thanks for digging in to this, that matches my understanding of the issue. What I don't quite understand yet is why this wasn't observed during the initial patch testing. It may be possible it was due to my specific kernel configuration. Regardless, I agree the fix here is going to need to be to save and restore the registers similar to the 5.2+ support.

@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.

@shartge
Copy link

shartge commented Sep 25, 2019

@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.

I am 100% sure, as this Kernel 5.2.9-2~bpo10+1 was the only Kernel installed on that system at that moment.

Also the version I copy-pasted was directly from uname -a.

Edit: Interesting bit: I was not able to reproduce this with stress-ng, as @ggzengel was, but mprime triggered it right away.

Edit²: Here is the line stress-ng logged via syslog:

Sep 25 13:35:46 storage-01 stress-ng: system: 'storage-01' Linux 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64

This was 10 Minutes before my first comment. I let stress-ng run for ~6 minutes with a scrub running a the same time. When that did non show any fails, I retested with mprime -t at 13:42, which immediately hit the problem at 13:43.

Edit³: I also checked if the hardware is fine, of course. Without ZFS mprime -t ran for 2 hours without any errors.

@behlendorf
Copy link
Contributor

@shartge would you mind checking the dkms build directory to verify that HAVE_KERNEL_TIF_NEED_FPU_LOAD was defined in the zfs_config.h file.

/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1

@shartge
Copy link

shartge commented Sep 25, 2019

I will, but it will have to wait until tomorrow, because right now I have reverted the system back to 4.19 and 0.7.2 and I have to wait until the backup window has finished. See #9346 (comment)

@shartge
Copy link

shartge commented Sep 25, 2019

Scratch that, I don't need that specific system to test the build, I can just use any Debian Buster system for that, for example any of my test VMs.

Using Linux debian-buster 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux and zfs-dkms 0.8.1-4~bpo10+1 I get:

/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1

I am attaching the whole file in case it may be helpful.
zfs_config.h.txt

@ggzengel
Copy link
Contributor

@shartge I had to reduce the CPUs to 18 for stress-ng because scrub was pausing while using all 32 CPUs.
I use n/2+2 CPUs because I have a NUMA system with 2 nodes.

@shartge
Copy link

shartge commented Sep 26, 2019

I now did a real test with the VM I used to do the compile test in #9346 (comment) and I am able to reproduce the bug very fast.

Using a 4 disk RAIDZ and dd if=/dev/zero of=testdata.dat bs=16M while running mprime -t at the same time quickly results in

[Worker #4 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #3 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #2 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:34] FATAL ERROR: Rounding was 1944592149, expected less than 0.4
[Worker #1 Sep 26 07:34] Hardware failure detected, consult stress.txt file.
[Worker #1 Sep 26 07:34] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #1 Sep 26 07:34] Worker stopped.

CPU for this system is

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping:            0
CPU MHz:             3092.734
BogoMIPS:            6185.46
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt arat md_clear flush_l1d arch_capabilities

Kernel and ZFS version can be found in #9346 (comment)

@ggzengel
Copy link
Contributor

This is a vmware client.
Does vmware have special FPU&IRQ handling inside the kernel or have a bug?

@shartge
Copy link

shartge commented Sep 26, 2019

This should not matter, as I can reproduce the same problem on 2 physical systems.

But because both of them are production storages, it is easier for me to do this in a VM, as long as it shows the same behaviour.

@ggzengel
Copy link
Contributor

The worst thing is that inside KVM the VMs get FPU errors too even they don't use ZFS.
I started a Debian live CD inside Proxmox, installed stress-ng and get a lot of errors if I start ZFS scrub at the host.

@shartge
Copy link

shartge commented Sep 26, 2019

This does not happen with VMware ESX for me. I've been running mprime -t in my Test-VM since 07:00 today and got not one single error.

Only when I have ZFS active and put I/O load on it, the FPU errors start to occur.

The same also happened for my with the two physical systems I used to test this.

@ggzengel
Copy link
Contributor

@shartge Are you using ZFS on VMware host?

@shartge
Copy link

shartge commented Sep 26, 2019

No!

I just quickly created a test VM to test the compilation of the module without the need to use and interrupt my production storages.

And I also tried to reproduce this issue here in a VM instead of a physical host, which, as I have show, I was successful in doing.

But, again: The error is reproducible on normal hardware with 5.2 and 0.8.1. (Using a VM is just more convenient.)

@ggzengel
Copy link
Contributor

ggzengel commented Sep 26, 2019

Summary:

  1. This happens only with ZFS 8.X
  2. FPU errors are always with kernel 4.19 - 5.1
  3. It shouldn't be with kernel 5.2 but there are exceptions
    3.1. @shartge gets FPU errors even with kernel 5.2 too
    3.2. @alex-gh and I didn't get errors with kernel 5.2
  4. I get FPU errors inside KVM-VM with ZFS 8.x and kernel 5.0 running at host side (Proxmox). There is no ZFS code inside the VM.
  5. The workaround is:
    5.1 run
    echo scalar > /sys/module/zcommon/parameters/zfs_fletcher_4_impl
    echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
    5.2 for persistence add
    zfs.zfs_vdev_raidz_impl=scalar zcommon.zfs_fletcher_4_impl=scalar
    to kernel parameter (GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub on debian and run update_grub)

@shartge
Copy link

shartge commented Sep 26, 2019

@shartge gets FPU errors even with kernel 5.2 too

Note that this is with the code patched by Debian for both the Kernel and ZFS. I have yet to try the vanilla ZFS code with 5.2.

It could very well be that the inclusion of e5db313 by Debian causes this.

@ggzengel
Copy link
Contributor

With Buster and 5.2 I don't get the FPU errors but it's dom0 from XEN: #9346 (comment)

@shartge
Copy link

shartge commented Sep 26, 2019

Who knows what Xen does with the FPU state in the Dom0. It could be a false negative as well.

@ggzengel
Copy link
Contributor

Now I checked it with 5.2 and without XEN. No FPU errors.

# cat /etc/apt/sources.list | grep -vE "^$|^#"
deb http://deb.debian.org/debian/ buster main non-free contrib
deb http://security.debian.org/debian-security buster/updates main contrib non-free
deb http://deb.debian.org/debian/ buster-updates main contrib non-free
deb http://deb.debian.org/debian/ buster-backports main contrib non-free

# dkms status
zfs, 0.8.1, 4.19.0-6-amd64, x86_64: installed
zfs, 0.8.1, 5.2.0-0.bpo.2-amd64, x86_64: installed

# uname -a
Linux xenserver2.donner14.private 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux

# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2599.803
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4799.63
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

# modinfo zfs
filename:       /lib/modules/5.2.0-0.bpo.2-amd64/updates/dkms/zfs.ko
version:        0.8.1-4~bpo10+1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
alias:          devname:zfs
alias:          char-major-10-249
srcversion:     FA9BDA7077DD9A40222C4B8
depends:        spl,znvpair,icp,zlua,zunicode,zcommon,zavl
retpoline:      Y
name:           zfs

# apt list | grep zfs | grep installed
libzfs2linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfs-dkms/buster-backports,now 0.8.1-4~bpo10+1 all [installed,automatic]
zfs-zed/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfsutils-linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed]

@shartge
Copy link

shartge commented Oct 7, 2019

But what might be worth it is disabling AVX512 support in Prime95. So you should see it using "FMA3 FFT" (which means using AVX2) and not "AVX-512 FFT". I don't know if it would make a difference but it seems like a good idea to me, when you using AVX2 code in ZFS.

You can do that by putting "CpuSupportsAVX512F=0" in local.txt (https://www.tomshardware.com/reviews/stress-test-cpu-pc-guide,5461-2.html)

Switching to AVX2/FMA3-FFT for mprime and using "fastest" (i.e AVX512) in ZFS also creates errors.

Switching ZFS to AVX2 while keeping mprime also at AVX2 creates errors, too.

And finally, setting ZFS to "ssse3" and keeping mprime at AVX2 still creates errors.

But @Fabian-Gruenbichler was able to reproduce this, so I can finally stop doubting myself.

@shartge
Copy link

shartge commented Oct 7, 2019

Interesting observation:

If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem.

For local.txt:

CpuSupportsAVX512F=0
CpuSupportsAVX2=0
CpuSupportsFMA4=0
CpuSupportsFMA3=0
CpuSupportsAVX=0

And mprime passes all three self tests:


[Worker #1 Oct 7 12:43] Test 1, 3100 Lucas-Lehmer iterations of M21871519 using type-2 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Oct 7 12:44] Test 2, 3100 Lucas-Lehmer in-place iterations of M20971521 using FFT length 1120K, Pass1=448, Pass2=2560, clm=4.
[Worker #1 Oct 7 12:44] Test 3, 3100 Lucas-Lehmer iterations of M20971519 using type-2 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Oct 7 12:45] Test 4, 4000 Lucas-Lehmer in-place iterations of M19922945 using FFT length 1120K, Pass1=448, Pass2=2560, clm=4.
[Worker #1 Oct 7 12:46] Self-test 1120K passed!
[Worker #1 Oct 7 12:46] Test 1, 1600000 Lucas-Lehmer in-place iterations of M83839 using FFT length 4K.
[Worker #1 Oct 7 12:47] Test 2, 1600000 Lucas-Lehmer in-place iterations of M82031 using FFT length 4K.
[Worker #1 Oct 7 12:48] Test 3, 1600000 Lucas-Lehmer in-place iterations of M79745 using FFT length 4K.
[Worker #1 Oct 7 12:48] Test 4, 1600000 Lucas-Lehmer in-place iterations of M77455 using FFT length 4K.
[Worker #1 Oct 7 12:49] Self-test 4K passed!
[Worker #1 Oct 7 12:49] Test 1, 1120000 Lucas-Lehmer in-place iterations of M107519 using FFT length 5K.
[Worker #1 Oct 7 12:50] Test 2, 1120000 Lucas-Lehmer in-place iterations of M106497 using FFT length 5K.
[Worker #1 Oct 7 12:51] Test 3, 1120000 Lucas-Lehmer in-place iterations of M104447 using FFT length 5K.
[Worker #1 Oct 7 12:51] Test 4, 1120000 Lucas-Lehmer in-place iterations of M102401 using FFT length 5K.
[Worker #1 Oct 7 12:52] Self-test 5K passed!

As soon as I enable anything above SSE2, starting with AVX, the errors return.

@vstax
Copy link

vstax commented Oct 7, 2019

If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem.

In SSE modes XMM registers are used, which are lower half of AVX (YMM) registers (or lower quad of AVX512 ZMM registers). Since this issue seems to be about saving/restoring registers when switching threads, using only lower part of register technically shouldn't change anything. If Prime95 is actually using SSE2 instructions, that is...

But maybe, just maybe, I'm really speculating here, kernel actually does save/restore on SSE (XMM) registers so the problem does not appear when Prime95 is only using XMM registers. It's upper part of YMM register that causes problem, that is, only SSE registers are saved/restored instead of whole 256 bit AVX ones. I don't know if this is possible :) Just thought I'd share an idea.

EDIT: this could happen if FXSAVE instruction which is called explicitly by #9406 works as expected but XSAVE feature in kernel doesn't work or isn't called correctly for some reason.

behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 10, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to unconditionally save and restore the FPU state.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 10, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 10, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 10, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 10, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
behlendorf added a commit that referenced this issue Oct 24, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #9346
Closes #9403
@vorsich-tiger
Copy link

I'd like to throw in a few words just before any final "fix" is committed and routed forward:
There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs.
I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels.
I.e. any 5.2 host running a VM indicating a zfs-bug just might deliver a false positive.
I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.

https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/
i.e.
https://bugzilla.kernel.org/show_bug.cgi?id=204209
https://lkml.org/lkml/2019/7/17/758
etc.

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Oct 25, 2019 via email

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Oct 25, 2019 via email

@vorsich-tiger
Copy link

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make.
1.
I wanted to get everybody on the same page in relation to the fact that not only zfs might be "disturbing" the SIMD processing subsystem in the kernel, but that there is a potential that other kernel portions might also be broken - the reference I gave shows that this was actually true.
2.
I am not questioning potentially required zfs SIMD fixes for kernel versions below 5.2
3.
It is my impression that the developers took quite some time thinking to establish certain assumptions that should be safe to be made for kernels starting with 5.2. Within the initial comments of this issue I see developers' statements which assume zfs SIMD for 5.2+ is not broken.
I merely wanted to raise awareness that tests indicating the opposite should be re-evaluated with the info from that reddit post in mind, i.e. just maybe zfs SIMD for 5.2+ is really not broken.

@shartge
Copy link

shartge commented Oct 25, 2019

i.e. just maybe zfs SIMD for 5.2+ is really not broken.

Negative on that. The same problem can be reproduced on non-KVM running baremetal hosts using 5.2+.

@Fabian-Gruenbichler
Copy link
Contributor

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make.
1.
I wanted to get everybody on the same page in relation to the fact that not only zfs might be "disturbing" the SIMD processing subsystem in the kernel, but that there is a potential that other kernel portions might also be broken - the reference I gave shows that this was actually true.
2.
I am not questioning potentially required zfs SIMD fixes for kernel versions below 5.2
3.
It is my impression that the developers took quite some time thinking to establish certain assumptions that should be safe to be made for kernels starting with 5.2. Within the initial comments of this issue I see developers' statements which assume zfs SIMD for 5.2+ is not broken.
I merely wanted to raise awareness that tests indicating the opposite should be re-evaluated with the info from that reddit post in mind, i.e. just maybe zfs SIMD for 5.2+ is really not broken.

I did not misunderstand your post. I am one of the devs who triaged this bug initially, analyzed the old code, verified a workaround on our downstream side, and reviewed the now merged fix 😉

see the detailed testing report (on baremetal!) over at
#9406 (comment)

the approach that was used for 5.2 was in theory sound for < 5.2, but not workable for GPL/license reasons. it was broken for 5.2+ though, as was the approach for < 5.2 on < 5.2 kernels. the only thing that really worked was the kernel-only solution (and a combination of 5.2+ approach with helper backports on < 5.2 kernels).

in other words, it was broken all around, irrespective of other FPU-related breakage on some 5.2 versions..

behlendorf added a commit to behlendorf/zfs that referenced this issue Oct 25, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
Closes openzfs#9403
@behlendorf
Copy link
Contributor

PR #9515 contains an 0.8 backport of the fix applied to master.

@shartge
Copy link

shartge commented Oct 28, 2019

PR #9515 contains an 0.8 backport of the fix applied to master.

I will be able to test this on my systems tomorrow GMT morning.

@shartge
Copy link

shartge commented Oct 29, 2019

I've had PR #9515 applied on top of the zfs-0.8-release branch on my test VM and one physical system, both first running for 4 hours on 5.2.0-bpo from Debian and then another 5 hours on 4.19 also from Debian and could no longer reproduce #9346.

From my point of view this looks very very promising.

jgallag88 pushed a commit to jgallag88/zfs that referenced this issue Dec 6, 2019
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.

Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.

This has the additional advantage of allowing us to use the FPU
again in user threads.  So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.

Reviewed-by: Fabian Grünbichler <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
Closes openzfs#9403
@mvrhov
Copy link

mvrhov commented Dec 24, 2019

Is this by any chance the same bug the Go authors found: https://bugzilla.kernel.org/show_bug.cgi?id=205663#c2

@behlendorf
Copy link
Contributor

@mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this.

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Jan 2, 2020 via email

stevijo pushed a commit to stevijo/zfs that referenced this issue Jan 12, 2020
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to unconditionally save and restore the FPU state.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#9346
(cherry picked from commit 813fd01)
Signed-off-by: Thomas Lamprecht <[email protected]>
@behlendorf
Copy link
Contributor

Closing. The SIMD patches have been included in the 0.8.3 release.

Mic92 added a commit to Mic92/nixpkgs that referenced this issue Feb 13, 2020
At the moment we experience bad instabilities with linux 5.3:

openzfs/zfs#9346

as the zfs-native method of disabling the FPU is buggy.

(cherry picked from commit 96097ab)
jackpot51 pushed a commit to pop-os/zfs-linux that referenced this issue Jan 20, 2022
 - 2000-increase-default-zcmd-allocation-to-256K.patch
 - git_fix_mount_race.patch

Remove patch:

 - linux-5.0-simd-compat.patch which causes #940932 under some
   certain conditions. openzfs/zfs#9346

Gbp-Dch: Full
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests