Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list_del corruption. next->prev should be XXX, but was dead000000000200 and crash in userquota_updates_task #7997

Closed
verygreen opened this issue Oct 7, 2018 · 17 comments

Comments

@verygreen
Copy link
Contributor

This is similar to #7933 om latest RHEL7.5 kernel and 0.7.11 checked out from the git tree, but I am hittign with with Lustre testing so I guess it is materially different.

Typically this hits in recovery testing shortly after the FS is brought up after recovery - first list corruption and then crash in userquota_updates_task stepping on a bad pointer:

[ 6366.191628] Lustre: DEBUG MARKER: centos-52.localnet: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[ 6366.457319] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
[ 6375.281988] ------------[ cut here ]------------
[ 6375.294283] WARNING: CPU: 4 PID: 1098 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
[ 6375.301869] list_del corruption. next->prev should be ffff8800af0f7cf8, but was dead000000000200
[ 6375.304407] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic drm_kms_helper pata_acpi drm ata_piix i2c_piix4 libata serio_raw pcspkr virtio_console i2c_core virtio_balloon virtio_blk floppy ip_tables [last unloaded: libcfs]
[ 6375.317889] CPU: 4 PID: 1098 Comm: dp_sync_taskq Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.5-debug2 #1
[ 6375.320240] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6375.321550] Call Trace:
[ 6375.326211]  [<ffffffff817744da>] dump_stack+0x19/0x1b
[ 6375.331553]  [<ffffffff81085798>] __warn+0xd8/0x100
[ 6375.332977]  [<ffffffff8108581f>] warn_slowpath_fmt+0x5f/0x80
[ 6375.334897]  [<ffffffff81779e4c>] ? __mutex_unlock_slowpath+0xdc/0x190
[ 6375.336225]  [<ffffffff813d42b2>] __list_del_entry+0x82/0xd0
[ 6375.338071]  [<ffffffff813d430d>] list_del+0xd/0x30
[ 6375.340917]  [<ffffffffa0b03945>] multilist_sublist_remove+0x15/0x20 [zfs]
[ 6375.343412]  [<ffffffffa0acc472>] userquota_updates_task+0x1d2/0x570 [zfs]
[ 6375.346302]  [<ffffffff810b15cb>] ? autoremove_wake_function+0x2b/0x40
[ 6375.348659]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.351962]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.355529]  [<ffffffffa04ffda7>] taskq_thread+0x2a7/0x4f0 [spl]
[ 6375.357697]  [<ffffffff810c5740>] ? wake_up_state+0x20/0x20
[ 6375.362061]  [<ffffffffa04ffb00>] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 6375.366040]  [<ffffffff810b0504>] kthread+0xe4/0xf0
[ 6375.368557]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.401413]  [<ffffffff817886b7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6375.429346]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.430864] ---[ end trace 1a66eb165735d7b7 ]---
[ 6375.443974] BUG: unable to handle kernel paging request at ffff8800af0f7f88
[ 6375.444109] IP: [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]
[ 6375.447010] PGD 23e4067 PUD 33fa01067 PMD 33f888067 PTE 80000000af0f7060
[ 6375.447244] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 6375.447244] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic crct10dif_common rpcsec_gss_krb5 ttm ata_generic drm_kms_helper pata_acpi drm ata_piix i2c_piix4 libata serio_raw pcspkr virtio_console i2c_core virtio_balloon virtio_blk floppy ip_tables [last unloaded: libcfs]
[ 6375.447244] CPU: 4 PID: 1098 Comm: dp_sync_taskq Kdump: loaded Tainted: P        W  OE  ------------   3.10.0-7.5-debug2 #1
[ 6375.447244] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 6375.447244] task: ffff88029df2e840 ti: ffff880298538000 task.ti: ffff880298538000
[ 6375.447244] RIP: 0010:[<ffffffffa0acc495>]  [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]
[ 6375.447244] RSP: 0018:ffff88029853bc58  EFLAGS: 00010286
[ 6375.447244] RAX: ffff8800af0f7c00 RBX: ffff8802f8c23800 RCX: 0000000000000000
[ 6375.447244] RDX: ffff8800947bd738 RSI: ffff8802d4be7e68 RDI: ffff8800947bd6c0
[ 6375.481303] RBP: ffff88029853bd28 R08: 0000000000000000 R09: 0000000000000067
[ 6375.481303] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800af0f7d08
[ 6375.481303] R13: ffff8800af0f7d50 R14: ffff8800947bd6c0 R15: ffff8800af0f7c00
[ 6375.481303] FS:  0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000
[ 6375.481303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6375.481303] CR2: ffff8800af0f7f88 CR3: 00000000ba024000 CR4: 00000000000006e0
[ 6375.481303] Call Trace:
[ 6375.481303]  [<ffffffff810b15cb>] ? autoremove_wake_function+0x2b/0x40
[ 6375.481303]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.481303]  [<ffffffffa0ac96a0>] ? dmu_objset_userobjspace_upgradable+0x60/0x60 [zfs]
[ 6375.481303]  [<ffffffffa04ffda7>] taskq_thread+0x2a7/0x4f0 [spl]
[ 6375.481303]  [<ffffffff810c5740>] ? wake_up_state+0x20/0x20
[ 6375.481303]  [<ffffffffa04ffb00>] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 6375.481303]  [<ffffffff810b0504>] kthread+0xe4/0xf0
[ 6375.481303]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.481303]  [<ffffffff817886b7>] ret_from_fork_nospec_begin+0x21/0x21
[ 6375.481303]  [<ffffffff810b0420>] ? kthread_create_on_node+0x140/0x140
[ 6375.481303] Code: 4c 89 f7 e8 be 74 03 00 48 8b b3 f8 03 00 00 4c 89 ff e8 7f ec 00 00 4c 89 f7 e8 c7 74 03 00 48 85 c0 49 89 c7 0f 84 5b 01 00 00 <45> 8b af 88 03 00 00 41 f6 c5 04 0f 84 d8 fe ff ff 49 8b 87 60 
[ 6375.481303] RIP  [<ffffffffa0acc495>] userquota_updates_task+0x1f5/0x570 [zfs]

I have crashdumps and can reproduce this reasonably easy if there are any ideas for fixes.

@behlendorf
Copy link
Contributor

@verygreen to help narrow this down can you confirm you weren't observing this issue with say 0.7.9, or an earlier 0.7.x tag.

@verygreen
Copy link
Contributor Author

Well, due to some oversight the previous release I was on was some checkout of 0.7.0 as it was reportign itself, though after some doublechecking the last commit was at end of May 2018, git hash is 1a5b96b - did not exhibit any problems on that checkout. I can go back to actual 0.7.9 to make sure if you think that would be helpful.

@verygreen
Copy link
Contributor Author

just to make it more clear hopefully, I hit this issue on 0.7.11 tag, I did not hit this issue on the checkout of 1a5b96b we also don't seem to have any signs of this in our testing on 0.7.9, but that does not mean too much since apparently our test run for 0.7.11 in standard test environment did not hit this either.

@tonyhutter
Copy link
Contributor

@verygreen could you give us some more detail on how to reproduce this? I'd like to see if I can hit it on our lustre test cluster.

@verygreen
Copy link
Contributor Author

I have a bunch of VMs running various lustre tests in a loop. the ones that hit these failures are:

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; EXCEPT=101 SLOW=yes REFORMAT=yes bash recovery-small.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; done

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; SLOW=yes REFORMAT=yes bash replay-single.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-ost-single.sh ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-dual.sh ; bash llmountcleanup.sh ; done

while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s') ; EXCEPT="32 36 67 76 78 102" SLOW=yes REFORMAT=yes bash conf-sanity.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ; for i in seq 0 7 ; do losetup -d /dev/loop$i ; done ; done

for best results have your kernel built with DEBUG_PAGEALLOC so accesses to freed memory lead to crash (and you don't need to just monitor for the error messages)

@verygreen
Copy link
Contributor Author

ah, btw a bit of a setup is missing, the other variables I set are like below.
obviously you need to force zfs. I cannot tell you readily if dne and other things play a bigger role here or not.

FSTYPE=ldiskfs
MDSSIZE=400000
MDSCOUNT=1
OSTCOUNT=4
# 50% probability - ZFS
test $((RANDOM % 2)) -eq 0 && FSTYPE=zfs MDSSIZE=600000

# 33% probability - DNE
test $((RANDOM % 3)) -eq 0 && MDSCOUNT=3

export FSTYPE
export MDSSIZE
export MDSCOUNT
export OSTCOUNT

you do not really need a "lustre cluster" for these tests to run. Every VM is a stand-alone "mds+oss+client" kind of config. the more of them you run the more chances something will break. I have 120 instances in my testset.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Oct 10, 2018
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Requires-spl: spl-0.7-release
Issue openzfs#7147
Issue openzfs#7388
Issue openzfs#7997
@behlendorf
Copy link
Contributor

@verygreen can you apply #8005 to the ZFS version you're testing. It should resolve the list corruption you're seeing. This issue was accidentally introduced by the dnode send/recv fixes, this was fixed in master by edc1e71 which I've backported.

45f0437 Fix 'zfs recv' of non large_dnode send streams
dc3eea8 Fix object reclaim when using large dnodes
d2c8103 Fix problems receiving reallocated dnodes

@verygreen
Copy link
Contributor Author

ok, perfect timing, I was just preparing to update my test setup for a new testing round. Will likely have some results in the morning.

Thanks!

@shodanshok
Copy link
Contributor

@behlendorf On what ZoL releases the race was introduced? Is the 0.7.9 release affected?
Thanks.

@rincebrain
Copy link
Contributor

@shodanshok those 3 commits first landed in the ill-fated 0.7.10. So assuming that the bug is solely from those, then 0.7.9 would not be affected, and the eventual 0.7.12 would have the fix.

@woffs
Copy link

woffs commented Oct 10, 2018

Seen this first time with zfs 0.7.11 on linux 4.18.6 in Debian. No problems with 0.7.9.

@verygreen
Copy link
Contributor Author

so far so good, 0 crashes in the past 9.5 hours. Without the patch I had at least one crash per hour on my setup.

@shodanshok
Copy link
Contributor

shodanshok commented Oct 10, 2018

@rincebrain Interesting. In #7933 (comment) I tried to replicate it with a very heavy workload based on concurrent instances of fio and fs_mark, with no avail. So, what does trigger the bug? Should I revert to 0.7.9 even if I did not hit the bug? Thanks.

@behlendorf
Copy link
Contributor

@shodanshok in order to hit the bug you need to write to a file and then unlink the file while it's being actively written to disk. Once the data's been sync'ed out you won't be able to hit it. You might be able to reproduce it with something like dbench.

tonyhutter pushed a commit to LLNL/zfs that referenced this issue Oct 10, 2018
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Tom Caputi <[email protected]>
Requires-spl: spl-0.7-release
Issue openzfs#7147
Issue openzfs#7388
Issue openzfs#7997
@verygreen
Copy link
Contributor Author

~24 hours and still no crashes so from my perspective this seems to be fixed now. Thanks!

@loli10K
Copy link
Contributor

loli10K commented Nov 18, 2019

The latest comment suggests this was indeed fixed by b32f127 (zfs-0.7.12), closing.

@loli10K loli10K closed this as completed Nov 18, 2019
@DrDaveD
Copy link

DrDaveD commented Nov 18, 2019

For the record, I still saw list_del corruption a few times in 7.13 as reported in #9068. The message was slightly different however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants