Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOL + encryption - crashes under heavy I/O #9603

Closed
MikeCockrem opened this issue Nov 21, 2019 · 3 comments
Closed

ZOL + encryption - crashes under heavy I/O #9603

MikeCockrem opened this issue Nov 21, 2019 · 3 comments
Labels
Type: Question Issue for discussion

Comments

@MikeCockrem
Copy link

MikeCockrem commented Nov 21, 2019

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.7.1908 (Core)
Linux Kernel 3.10.0-1062.4.3.el7.x86_64
Architecture x86_64
ZFS Version 0.8.2-1
SPL Version 0.8.2-1

Hello. First time posting, so please feel free to point out anything I'm missing. I recently upgraded from 0.7 to 0.8.2. Things had been running flawlessly until I created an encrypted dataset. and ran MD5DEEP (md5deep-4.4-1.el7.x86_64) to weed out duplicate files.
This works fine with every one of my datasets except the encrypted one, which consistently crashes the whole server after a few minutes of operation.

For reference, I am running a HP ProLiant MicroServer Gen8 with 16GB ECC RAM and Xeon(R) CPU E31260L. Hardware stress test and memory test all clear. edac_util reports no errors.

I believe it may be unrelated to issue #9346 as I first guessed as I tried
"echo scalar > /sys/module/zcommon/parameters/zfs_fletcher_4_impl"
and re ran the test without success.

this screenshot was caught from nmon (frozen) before the server crashed.

EDIT: I forgot to mention, loading the directories on a client (shared out under samba) lists files in non encrypted datasets in the normal time. The encrypted dataset takes +30s to load. Terrible performance....

Describe how to reproduce the problem

1. Run md5deep on non-encrypted datasets - result, exits normally
$ md5deep -re /media/zfspool01/non-encrypted-dataset-1 >> /tmp/out.txt

2. Create encrypted dataset:
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase zfspool01/enctest

3.  Run md5deep against files in the encrypted dataset:
$ md5deep -re /media/zfspool01/enctest >> /tmp/out.txt

System crashes and is forcibly rebooted automatically.

~]# uptime
 16:25:08 up 01 min,  1 user,  load average: 0.00, 0.01, 0.05

Include any warning/errors/backtraces from the system logs

#The following is found in crash dump dmesg file after automatic reboot:

~]# tail /var/crash/127.0.0.1-2019-11-21-15\:43\:28/vmcore-dmesg.txt 
[ 1523.527829] Kernel panic - not syncing: 02: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
1. Integrated Management Log (IML)
2. OA Syslog
3. OA Forward Progress Log
4. iLO Event Log
[ 1523.527833] CPU: 0 PID: 9156 Comm: md5deep Kdump: loaded Tainted: P           OE  ------------   3.10.0-1062.4.3.el7.x86_64 #1
[ 1523.527842] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/09/2013
[ 1523.527842] Call Trace:
[ 1523.527851]  [<ffffffffa9579ba4>] dump_stack+0x19/0x1b
[ 1523.527855]  [<ffffffffa9573947>] panic+0xe8/0x21f
[ 1523.527870]  [<ffffffffa8e9b6af>] nmi_panic+0x3f/0x40
[ 1523.527876]  [<ffffffffc04133df>] hpwdt_pretimeout+0x6f/0xb0 [hpwdt]
[ 1523.527880]  [<ffffffffa958493c>] nmi_handle.isra.0+0x8c/0x150
[ 1523.527882]  [<ffffffffa9583c9b>] ? nmi+0xdb/0x158
[ 1523.527886]  [<ffffffffa9584bc4>] do_nmi+0x1c4/0x460
[ 1523.527890]  [<ffffffffa8e30e71>] ? iommu_shutdown_noop+0x1/0x10
[ 1523.527893]  [<ffffffffa9583cc9>] nmi+0x109/0x158
zfs get encryption
NAME                      PROPERTY    VALUE        SOURCE
zfspool01                 encryption  off          default
zfspool01/files1           encryption  off          default
zfspool01/Files2           encryption  off          default
zfspool01/enctest            encryption  aes-256-gcm  -

~]# zpool status
  pool: zfspool01
 state: ONLINE
  scan: scrub canceled on Thu Nov 21 14:45:00 2019
config:

	NAME          STATE     READ WRITE CKSUM
	zfspool01     ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    ZA15  ONLINE       0     0     0
	    ZA17  ONLINE       0     0     0
	  mirror-1    ONLINE       0     0     0
	    ZA19  ONLINE       0     0     0
	    ZA16  ONLINE       0     0     0

errors: No known data errors

~]# cat /proc/spl/kstat/zfs/fletcher_4_bench 
5 0 0x01 -1 0 5683358186 2599746398022
implementation   native         byteswap       
scalar           4969941969     1183872709     
superscalar      5660964047     3514431530     
superscalar4     4796425319     3369569751     
sse2             7865073405     3730504619     
ssse3            7927111431     7587943841     
fastest          ssse3          ssse3  

@PrivatePuffin
Copy link
Contributor

Might this be related to #9583 ?

@behlendorf
Copy link
Contributor

@MikeCockrem the crash here was caused by an NMI from the hpwdt driver which is designed to restart the system when it detects a problem. You might be able to find additional diagnostic information in IML log, or additional information in the full debug log.

https://www.mjmwired.net/kernel/Documentation/watchdog/hpwdt.txt

There's a good description of the high level NMI behavior at the following link as it pertains to the system watchdog. However, unless you see the watchdog warning in dmesg that probably wasn't what triggered it. There may be other criteria which force an NMI to be generated but ZFS won't do do directly. One possibility is that heavy cpu load from encryption looked like a cpu stall to the hpwdt driver causing a NMI reset. Unloading the hpwdt module should disable this functionality for testing.

https://access.redhat.com/solutions/1309033

@Ornias1993 no, the issue referenced wouldn't result in an NMI like this.

@behlendorf behlendorf added the Type: Question Issue for discussion label Nov 21, 2019
@MikeCockrem
Copy link
Author

MikeCockrem commented Nov 23, 2019

@MikeCockrem the crash here was caused by an NMI from the hpwdt driver which is designed to restart the system when it detects a problem. You might be able to find additional diagnostic information in IML log, or additional information in the full debug log.

https://www.mjmwired.net/kernel/Documentation/watchdog/hpwdt.txt

There's a good description of the high level NMI behavior at the following link as it pertains to the system watchdog. However, unless you see the watchdog warning in dmesg that probably wasn't what triggered it. There may be other criteria which force an NMI to be generated but ZFS won't do do directly. One possibility is that heavy cpu load from encryption looked like a cpu stall to the hpwdt driver causing a NMI reset. Unloading the hpwdt module should disable this functionality for testing.

https://access.redhat.com/solutions/1309033

@Ornias1993 no, the issue referenced wouldn't result in an NMI like this.

Thank you, I will unload this module and run the tests again tonight.
I have found what was probably hanging the system, wine had created a file "Z:" that was treated as a symbolic link to / causing md5deep to then rummage through /proc /dev etc etc...:

/media/zfspool01/enctest/.Trash-1000/expunged/934939181/mike/.wine/dosdevices/z\:/

I'm still at a loss as to why samba performance on the encrypted dataset is so awful, but at least now I see why it was crashing. Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Question Issue for discussion
Projects
None yet
Development

No branches or pull requests

3 participants