list_del corruption with latest Centos 7 kernel and 0.7.13-1 #9068

DrDaveD · 2019-07-19T18:09:40Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.6.1810
Linux Kernel	3.10.0-957.21.3.el7
Architecture	x86_64
ZFS Version	0.7.13-1.el7_6
SPL Version	0.7.13-1.el7_6

Describe the problem you're observing

kernel panic with message

list_del corruption, ffff9a428c161b00->next is LIST_POISON1 (dead000000000100)

This is the same message I reported with a previous kernel and 0.7.11-1 in #7933. It was supposed to have been fixed by #8005 in 0.7.12-1. This time the system didn't completely crash but it spewed the same message over and over, making systemd-journald compute bound, slowing interactive response time and stopping all activity to the zfs volume.

Describe how to reproduce the problem

Unfortunately I don't have a recipe to reproduce. We were doing lots of write activity to the volume from parallel processes. It ran for several hours before getting the panic. I haven't seen it fail again after several hours with reduced write activity, so I'm ramping it back up again.

We had been running this system since last October on the older kernel and 0.7.9 without a problem. We recently got a new identical system and went through the same process of repopulating all the data (22T of it, mostly small files, takes about a day and half) on a slightly older kernel 3.10.0-957.21.2.el7 and the same zfs/spl version 0.7.13-1.el7 without a problem.

Include any warning/errors/backtraces from the system logs

Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: WARNING: CPU: 12 PID: 25539 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: list_del corruption, ffff9a428c161b00->next is LIST_POISON1 (dead000000000100)
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: Modules linked in: zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) ip6table_nat nf_nat_ipv6 xt_REDIRECT nf_nat_redirect xt_multiport xt_comment 8021q garp mrp nfs lockd grace sunrpc fscache ip6table_raw iptable_raw iptable_security iptable_mangle iptable_nat nf_nat_ipv4 nf_nat ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables ip_set nfnetlink bridge stp llc sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt iTCO_vendor_support kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev i2c_i801 lpc_ich mei_me mei ioatdma ses enclosure
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: mxm_wmi sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad xfs libcrc32c raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci ixgbe libahci nvme crct10dif_pclmul crct10dif_common mpt3sas crc32c_intel libata nvme_core mdio drm_panel_orientation_quirks ptp raid_class pps_core scsi_transport_sas dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: CPU: 12 PID: 25539 Comm: dp_sync_taskq
 Kdump: loaded Tainted: P           OE  ------------   3.10.0-957.21.3.el7.x86_64 #1
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: Hardware name: Supermicro SSG-6048R-E1CR36L/X10DRH-iT, BIOS 3.1a 04/09/2019
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: Call Trace:
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f763107>] dump_stack+0x19/0x1b
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f097768>] __warn+0xd8/0x100
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f0977ef>] warn_slowpath_fmt+0x5f/0x80
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f3952a3>] __list_del_entry+0x63/0xd0
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f39531d>] list_del+0xd/0x30
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffffc1275545>] multilist_sublist_remove+0x15/0x20 [zfs]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffffc123e08f>] userquota_updates_task+0xff/0x5b0 [zfs]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f0c2e9b>] ? autoremove_wake_function+0x2b/0x40
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffffc123b3c0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffffc123b3c0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffffc0a84d7c>] taskq_thread+0x2ac/0x4f0 [spl]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f0d6b60>] ? wake_up_state+0x20/0x20
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffffc0a84ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f0c1da1>] kthread+0xd1/0xe0
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f0c1cd0>] ? insert_kthread_work+0x40/0x40
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f775c37>] ret_from_fork_nospec_begin+0x21/0x21
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: [<ffffffff9f0c1cd0>] ? insert_kthread_work+0x40/0x40
Jul 18 19:52:54 hcc-cvmfs.unl.edu kernel: ---[ end trace 40e6011d64bb1859 ]---

The text was updated successfully, but these errors were encountered:

DrDaveD · 2019-07-22T20:35:35Z

After reboot we were able to entirely populate the data and it didn't crash. So this is not very reproducible.

In addition, it turns out that at the same time that we wiped and reinstalled everything on this machine last week, the system administrator also decided to try enabling hyperthreading. This was not the case previously (confirmed by the monitoring history) nor on the new machine. So we suspect that hyperthreading was somehow involved in the crash. We do not have logs going back far enough to see if hyperthreading was enabled last year when we had crashes; it's possible, but we don't have a memory of it or proof. We are now disabling hyperthreading. The machines each have 16 physical cores.

DrDaveD · 2019-07-26T19:31:49Z

Since we are no longer having the problem, I close this issue.

DrDaveD · 2019-08-17T03:23:44Z

This now happened again on the same machine.

hhhappe · 2019-09-22T08:43:48Z

I've experienced the same. Will the #8005 fix go into 0.7.14?

DrDaveD · 2019-09-23T20:45:46Z

According to a comment in #8005 that fix was already in 7.12. It must not have completely fixed the problem however.

DrDaveD · 2019-10-07T20:55:01Z

Since this occurs only rarely (but not rarely enough), I note that it happened again today. It wasn't a crash but it froze up all zfs accesses until a reboot. I confirm that hyperthreading is still disabled. The zfs and spl versions haven't changed. The kernel had been upgraded, to 3.10.0-957.27.2.el7.x86_64.

Oct  7 06:19:49 hcc-cvmfs.unl.edu kernel: WARNING: CPU: 14 PID: 10131 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0
Oct  7 06:19:49 hcc-cvmfs.unl.edu kernel: list_del corruption, ffffa02e9f695fd8->next is LIST_POISON1 (dead000000000100)

stale · 2020-10-06T21:25:05Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

DrDaveD mentioned this issue Jul 19, 2019

list_del corruption with latest Centos 7 kernel and 0.7.11-1 #7933

Closed

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 25, 2019

DrDaveD closed this as completed Jul 26, 2019

DrDaveD reopened this Aug 17, 2019

DrDaveD mentioned this issue Nov 18, 2019

list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

Closed

stuartthebruce mentioned this issue Aug 12, 2020

arc_prune list_del corruption with 0.8.4 #10707

Closed

stale bot added the Status: Stale No recent activity for issue label Oct 6, 2020

stale bot closed this as completed Jan 4, 2021

AeonJJohnson mentioned this issue Apr 1, 2021

0.7.13 / CentOS 7.9 - list_del corruption #11824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list_del corruption with latest Centos 7 kernel and 0.7.13-1 #9068

list_del corruption with latest Centos 7 kernel and 0.7.13-1 #9068

DrDaveD commented Jul 19, 2019

DrDaveD commented Jul 22, 2019 •

edited

Loading

DrDaveD commented Jul 26, 2019

DrDaveD commented Aug 17, 2019

hhhappe commented Sep 22, 2019

DrDaveD commented Sep 23, 2019

DrDaveD commented Oct 7, 2019 •

edited

Loading

stale bot commented Oct 6, 2020

list_del corruption with latest Centos 7 kernel and 0.7.13-1 #9068

list_del corruption with latest Centos 7 kernel and 0.7.13-1 #9068

Comments

DrDaveD commented Jul 19, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

DrDaveD commented Jul 22, 2019 • edited Loading

DrDaveD commented Jul 26, 2019

DrDaveD commented Aug 17, 2019

hhhappe commented Sep 22, 2019

DrDaveD commented Sep 23, 2019

DrDaveD commented Oct 7, 2019 • edited Loading

stale bot commented Oct 6, 2020

DrDaveD commented Jul 22, 2019 •

edited

Loading

DrDaveD commented Oct 7, 2019 •

edited

Loading