list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

verygreen · 2018-10-07T16:03:55Z

This is similar to #7933 om latest RHEL7.5 kernel and 0.7.11 checked out from the git tree, but I am hittign with with Lustre testing so I guess it is materially different.

Typically this hits in recovery testing shortly after the FS is brought up after recovery - first list corruption and then crash in userquota_updates_task stepping on a bad pointer:

[ 6366.191628] Lustre: DEBUG MARKER: centos-52.localnet: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[ 6366.457319] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
[ 6375.281988] ------------[ cut here ]------------
[ 6375.294283] WARNING: CPU: 4 PID: 1098 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
[ 6375.301869] list_del corruption. next->prev should be ffff8800af0f7cf8, but was dead000000000200
[ 6375.304407] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic drm_kms_helper pata_acpi drm ata_piix i2c_piix4 libata serio_raw pcspkr virtio_console i2c_core virtio_balloon virtio_blk floppy ip_tables [last unloaded: libcfs]
[ 6375.317889] CPU: 4 PID: 1098 Comm: dp_sync_taskq Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.5-debug2 #1
[ 6375.320240] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6375.321550] Call Trace:
[ 6375.326211]  [<ffffffff817744da>] dump_stack+0x19/0x1b
[ 6375.331553]  [<ffffffff81085798>] __warn+0xd8/0x100
[ 6375.332977]  [<ffffffff8108581f>] warn_slowpath_fmt+0x5f/0x80
[ 6375.334897]  [<ffffffff81779e4c>] ? __mutex_unlock_slowpath+0xdc/0x190
[ 6375.336225]  [<ffffffff813d42b2>] __list_del_entry+0x82/0xd0
[ 6375.338071]  [<ffffffff813d430d>] list_del+0xd/0x30
[ 6375.340917]  [<ffffffffa0b03945>] multilist_sublist_remove+0x15/0x20 [zfs]
[ 6375.343412]  [<ffffffffa0acc472>] userquota_updates_task+0x1d2/0x570 [zfs]
[ 6375.346302]  [<ffffffff810b15cb>] ? autoremove_wake_function+0x2b/0x40
[ 6375.348659]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.351962]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.355529]  [<ffffffffa04ffda7>] taskq_thread+0x2a7/0x4f0 [spl]
[ 6375.357697]  [<ffffffff810c5740>] ? wake_up_state+0x20/0x20
[ 6375.362061]  [<ffffffffa04ffb00>] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 6375.366040]  [<ffffffff810b0504>] kthread+0xe4/0xf0
[ 6375.368557]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.401413]  [<ffffffff817886b7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6375.429346]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.430864] ---[ end trace 1a66eb165735d7b7 ]---
[ 6375.443974] BUG: unable to handle kernel paging request at ffff8800af0f7f88
[ 6375.444109] IP: [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]
[ 6375.447010] PGD 23e4067 PUD 33fa01067 PMD 33f888067 PTE 80000000af0f7060
[ 6375.447244] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 6375.447244] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic drm_kms_helper pata_acpi drm ata_piix i2c_piix4 libata serio_raw pcspkr virtio_console i2c_core virtio_balloon virtio_blk floppy ip_tables [last unloaded: libcfs]
[ 6375.447244] CPU: 4 PID: 1098 Comm: dp_sync_taskq Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.5-debug2 #1
[ 6375.447244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6375.447244] task: ffff88029df2e840 ti: ffff880298538000 task.ti: ffff880298538000
[ 6375.447244] RIP: 0010:[<ffffffffa0acc495>]  [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]
[ 6375.447244] RSP: 0018:ffff88029853bc58  EFLAGS: 00010286
[ 6375.447244] RAX: ffff8800af0f7c00 RBX: ffff8802f8c23800 RCX: 0000000000000000
[ 6375.447244] RDX: ffff8800947bd738 RSI: ffff8802d4be7e68 RDI: ffff8800947bd6c0
[ 6375.481303] RBP: ffff88029853bd28 R08: 0000000000000000 R09: 0000000000000067
[ 6375.481303] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800af0f7d08
[ 6375.481303] R13: ffff8800af0f7d50 R14: ffff8800947bd6c0 R15: ffff8800af0f7c00
[ 6375.481303] FS:  0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000
[ 6375.481303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6375.481303] CR2: ffff8800af0f7f88 CR3: 00000000ba024000 CR4: 00000000000006e0
[ 6375.481303] Call Trace:
[ 6375.481303]  [<ffffffff810b15cb>] ? autoremove_wake_function+0x2b/0x40
[ 6375.481303]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.481303]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.481303]  [<ffffffffa04ffda7>] taskq_thread+0x2a7/0x4f0 [spl]
[ 6375.481303]  [<ffffffff810c5740>] ? wake_up_state+0x20/0x20
[ 6375.481303]  [<ffffffffa04ffb00>] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 6375.481303]  [<ffffffff810b0504>] kthread+0xe4/0xf0
[ 6375.481303]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.481303]  [<ffffffff817886b7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6375.481303]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.481303] Code: 4c 89 f7 e8 be 74 03 00 48 8b b3 f8 03 00 00 4c 89 ff e8 7f ec 00 00 4c 89 f7 e8 c7 74 03 00 48 85 c0 49 89 c7 0f 84 5b 01 00 00 <45> 8b af 88 03 00 00 41 f6 c5 04 0f 84 d8 fe ff ff 49 8b 87 60 
[ 6375.481303] RIP  [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]

I have crashdumps and can reproduce this reasonably easy if there are any ideas for fixes.

The text was updated successfully, but these errors were encountered:

behlendorf · 2018-10-07T19:50:43Z

@verygreen to help narrow this down can you confirm you weren't observing this issue with say 0.7.9, or an earlier 0.7.x tag.

verygreen · 2018-10-07T21:29:26Z

Well, due to some oversight the previous release I was on was some checkout of 0.7.0 as it was reportign itself, though after some doublechecking the last commit was at end of May 2018, git hash is 1a5b96b - did not exhibit any problems on that checkout. I can go back to actual 0.7.9 to make sure if you think that would be helpful.

verygreen · 2018-10-07T22:15:43Z

just to make it more clear hopefully, I hit this issue on 0.7.11 tag, I did not hit this issue on the checkout of 1a5b96b we also don't seem to have any signs of this in our testing on 0.7.9, but that does not mean too much since apparently our test run for 0.7.11 in standard test environment did not hit this either.

tonyhutter · 2018-10-10T00:02:40Z

@verygreen could you give us some more detail on how to reproduce this? I'd like to see if I can hit it on our lustre test cluster.

verygreen · 2018-10-10T00:18:05Z

I have a bunch of VMs running various lustre tests in a loop. the ones that hit these failures are:

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; EXCEPT=101 SLOW=yes REFORMAT=yes bash recovery-small.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; done

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; SLOW=yes REFORMAT=yes bash replay-single.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-ost-single.sh ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-dual.sh ; bash llmountcleanup.sh ; done

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; EXCEPT="32 36 67 76 78 102" SLOW=yes REFORMAT=yes bash conf-sanity.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; for i in seq 0 7 ; do losetup -d /dev/loop$i ; done ; done

for best results have your kernel built with DEBUG_PAGEALLOC so accesses to freed memory lead to crash (and you don't need to just monitor for the error messages)

verygreen · 2018-10-10T00:34:23Z

ah, btw a bit of a setup is missing, the other variables I set are like below.
obviously you need to force zfs. I cannot tell you readily if dne and other things play a bigger role here or not.

FSTYPE=ldiskfs
MDSSIZE=400000
MDSCOUNT=1
OSTCOUNT=4
# 50% probability - ZFS
test $((RANDOM % 2)) -eq 0 && FSTYPE=zfs MDSSIZE=600000

# 33% probability - DNE
test $((RANDOM % 3)) -eq 0 && MDSCOUNT=3

export FSTYPE
export MDSSIZE
export MDSCOUNT
export OSTCOUNT

you do not really need a "lustre cluster" for these tests to run. Every VM is a stand-alone "mds+oss+client" kind of config. the more of them you run the more chances something will break. I have 120 instances in my testset.

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Requires-spl: spl-0.7-release Issue openzfs#7147 Issue openzfs#7388 Issue openzfs#7997

behlendorf · 2018-10-10T01:36:05Z

@verygreen can you apply #8005 to the ZFS version you're testing. It should resolve the list corruption you're seeing. This issue was accidentally introduced by the dnode send/recv fixes, this was fixed in master by edc1e71 which I've backported.

45f0437 Fix 'zfs recv' of non large_dnode send streams
dc3eea8 Fix object reclaim when using large dnodes
d2c8103 Fix problems receiving reallocated dnodes

verygreen · 2018-10-10T01:40:31Z

ok, perfect timing, I was just preparing to update my test setup for a new testing round. Will likely have some results in the morning.

Thanks!

shodanshok · 2018-10-10T06:10:41Z

@behlendorf On what ZoL releases the race was introduced? Is the 0.7.9 release affected?
Thanks.

rincebrain · 2018-10-10T06:53:32Z

@shodanshok those 3 commits first landed in the ill-fated 0.7.10. So assuming that the bug is solely from those, then 0.7.9 would not be affected, and the eventual 0.7.12 would have the fix.

woffs · 2018-10-10T10:04:52Z

Seen this first time with zfs 0.7.11 on linux 4.18.6 in Debian. No problems with 0.7.9.

verygreen · 2018-10-10T14:45:01Z

so far so good, 0 crashes in the past 9.5 hours. Without the patch I had at least one crash per hour on my setup.

shodanshok · 2018-10-10T17:03:05Z

@rincebrain Interesting. In #7933 (comment) I tried to replicate it with a very heavy workload based on concurrent instances of fio and fs_mark, with no avail. So, what does trigger the bug? Should I revert to 0.7.9 even if I did not hit the bug? Thanks.

behlendorf · 2018-10-10T17:52:38Z

@shodanshok in order to hit the bug you need to write to a file and then unlink the file while it's being actively written to disk. Once the data's been sync'ed out you won't be able to hit it. You might be able to reproduce it with something like dbench.

Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tom Caputi <[email protected]> Requires-spl: spl-0.7-release Issue openzfs#7147 Issue openzfs#7388 Issue openzfs#7997

verygreen · 2018-10-11T04:10:26Z

~24 hours and still no crashes so from my perspective this seems to be fixed now. Thanks!

loli10K · 2019-11-18T07:01:29Z

The latest comment suggests this was indeed fixed by b32f127 (zfs-0.7.12), closing.

DrDaveD · 2019-11-18T15:38:32Z

For the record, I still saw list_del corruption a few times in 7.13 as reported in #9068. The message was slightly different however.

tonyhutter mentioned this issue Oct 10, 2018

list_del corruption with latest Centos 7 kernel and 0.7.11-1 #7933

Closed

behlendorf mentioned this issue Oct 10, 2018

Fix race in dnode_check_slots_free() #8005

Closed

13 tasks

bunder2015 mentioned this issue Oct 16, 2018

dp_sync_taskq causes machine to crash #8027

Closed

bernie1995 mentioned this issue Oct 20, 2018

General protection fault - userquota_updates_task #8048

Closed

bunder2015 added the Missing Template label Feb 14, 2019

loli10K closed this as completed Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

verygreen commented Oct 7, 2018

behlendorf commented Oct 7, 2018

verygreen commented Oct 7, 2018

verygreen commented Oct 7, 2018

tonyhutter commented Oct 10, 2018

verygreen commented Oct 10, 2018

verygreen commented Oct 10, 2018

behlendorf commented Oct 10, 2018

verygreen commented Oct 10, 2018

shodanshok commented Oct 10, 2018

rincebrain commented Oct 10, 2018

woffs commented Oct 10, 2018

verygreen commented Oct 10, 2018

shodanshok commented Oct 10, 2018 •

edited

Loading

behlendorf commented Oct 10, 2018

verygreen commented Oct 11, 2018

loli10K commented Nov 18, 2019

DrDaveD commented Nov 18, 2019

list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

Comments

verygreen commented Oct 7, 2018

behlendorf commented Oct 7, 2018

verygreen commented Oct 7, 2018

verygreen commented Oct 7, 2018

tonyhutter commented Oct 10, 2018

verygreen commented Oct 10, 2018

verygreen commented Oct 10, 2018

behlendorf commented Oct 10, 2018

verygreen commented Oct 10, 2018

shodanshok commented Oct 10, 2018

rincebrain commented Oct 10, 2018

woffs commented Oct 10, 2018

verygreen commented Oct 10, 2018

shodanshok commented Oct 10, 2018 • edited Loading

behlendorf commented Oct 10, 2018

verygreen commented Oct 11, 2018

loli10K commented Nov 18, 2019

DrDaveD commented Nov 18, 2019

shodanshok commented Oct 10, 2018 •

edited

Loading