-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY(zio->io_children[c][w] == 0) failed #5918
Comments
Possibly related to or similar in root cause as #5429? Gang block handling issue? |
Not sure. The space usage thing is anecdotal at best and I don't see anything related to pool space allocation going on here. |
I've had another crash. Stack trace is nigh-identical and below. This one is flowed properly from a kernel crash dump. Is there anything I can do to help debug? I'm building a debug version of ZFS to install but don't know what else to do. [603767.293140] general protection fault: 0000 [#1] SMP [603767.293317] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill tcp_diag udp_diag inet_diag unix_diag 8021q garp mrp xt_state rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache sch_fq tcp_bbr ixgbevf nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter iTCO_wdt iTCO_vendor_support sb_edac edac_core x86_pkg_temp_thermal coretemp zfs(PO) zunicode(PO) zavl(PO) kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd icp(PO) intel_cstate intel_rapl_perf zcommon(PO) znvpair(PO) spl(O) pcspkr ses enclosure ipmi_devintf i2c_i801 i2c_smbus mei_me lpc_ich sg joydev input_leds mfd_core mei ioatdma shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace [603767.295318] sunrpc ip_tables ext4 jbd2 mbcache raid1 sd_mod crc32c_intel i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci ixgbe libahci mdio drm ptp libata mpt3sas pps_core raid_class dca scsi_transport_sas fjes dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dccp] [603767.296114] CPU: 11 PID: 27487 Comm: z_rd_int_0 Tainted: P O 4.9.11-1.el7.elrepo.x86_64 #1 [603767.296440] Hardware name: Supermicro Super Server/X10DRD-LTP, BIOS 2.0 02/26/2016 [603767.296762] task: ffff88202c185b00 task.stack: ffffc90043b64000 [603767.296931] RIP: 0010:[<ffffffff81381d0f>] [<ffffffff81381d0f>] __list_add+0xf/0xc0 [603767.297260] RSP: 0018:ffffc90043b67a20 EFLAGS: 00010282 [603767.297427] RAX: 0000000000000030 RBX: ffff881ce2cef080 RCX: ffff8802bc857c90 [603767.297746] RDX: dead000000000100 RSI: ffff881b9861be10 RDI: ffff8802bc857cb0 [603767.298067] RBP: ffffc90043b67a38 R08: 000000000001cd20 R09: 0000000000000e00 [603767.298386] R10: ffff881e4a642580 R11: ffff880f61d13ae0 R12: dead000000000100 [603767.298707] R13: ffff881b9861be10 R14: ffff881ce2cef400 R15: ffff881b9861c070 [603767.299027] FS: 0000000000000000(0000) GS:ffff88203f2c0000(0000) knlGS:0000000000000000 [603767.299350] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [603767.299516] CR2: 00007fbeb123e000 CR3: 0000000ce2484000 CR4: 00000000003406e0 [603767.299837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [603767.300157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [603767.300477] Stack: [603767.300636] ffff881ce2cef080 ffff881b9861bcf0 ffff8802bc857c90 ffffc90043b67a70 [603767.300971] ffffffffa18101cf ffff881ce2cef080 ffff88135e59cd00 ffff881b9861bcf0 [603767.301304] ffff88171d1727e8 ffff881ba6267f00 ffffc90043b67ac0 ffffffffa1810649 [603767.301637] Call Trace: [603767.301859] [<ffffffffa18101cf>] zio_add_child+0x9f/0x120 [zfs] [603767.302059] [<ffffffffa1810649>] zio_create+0x3f9/0x4e0 [zfs] [603767.302257] [<ffffffffa1810c21>] zio_read+0xc1/0xe0 [zfs] [603767.302444] [<ffffffffa175a3e0>] ? arc_read+0xa00/0xa00 [zfs] [603767.302631] [<ffffffffa175a3e0>] ? arc_read+0xa00/0xa00 [zfs] [603767.302818] [<ffffffffa1759fb1>] arc_read+0x5d1/0xa00 [zfs] [603767.303023] [<ffffffffa1761b6e>] dbuf_issue_final_prefetch+0x7e/0x90 [zfs] [603767.303222] [<ffffffffa1765c90>] dbuf_prefetch_indirect_done+0x100/0x1b0 [zfs] [603767.303573] [<ffffffffa1811f19>] ? zio_nowait+0xb9/0x150 [zfs] [603767.303761] [<ffffffffa175a556>] arc_read_done+0x176/0x2d0 [zfs] [603767.303949] [<ffffffffa175aa32>] l2arc_read_done+0x382/0x3f0 [zfs] [603767.304150] [<ffffffffa17c57c9>] ? vdev_stat_update+0x1e9/0x4b0 [zfs] [603767.304349] [<ffffffffa1813941>] zio_done+0x311/0xc00 [zfs] [603767.304546] [<ffffffffa180e392>] ? zio_wait_for_children+0x62/0x80 [zfs] [603767.304745] [<ffffffffa180e2cf>] zio_execute+0x8f/0xf0 [zfs] [603767.304919] [<ffffffffa05e80a8>] taskq_thread+0x248/0x460 [spl] [603767.305092] [<ffffffff810af070>] ? wake_up_q+0x80/0x80 [603767.305261] [<ffffffffa05e7e60>] ? taskq_thread_spawn+0x50/0x50 [spl] [603767.305431] [<ffffffff810a27e9>] kthread+0xd9/0xf0 [603767.305597] [<ffffffff810a2710>] ? kthread_park+0x60/0x60 [603767.305767] [<ffffffff810a2710>] ? kthread_park+0x60/0x60 [603767.305936] [<ffffffff81757e55>] ret_from_fork+0x25/0x30 [603767.306103] Code: ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 0f 1f 80 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 <4c> 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68 [603767.306829] RIP [<ffffffff81381d0f>] __list_add+0xf/0xc0 [603767.307001] RSP <ffffc90043b67a20> |
Argument 3 of @DeHackEd It might be helpful if you could install the debuginfo for your kernel. Then you could run @dweeezil I suspect that this is unrelated to #5429 because of the LIST_POISON1. |
elrepo doesn't make debuginfo builds. I'll have to run a custom build. But I'll do that. |
@DeHackEd In hindsight, no need. Just run it on your actual kernel binary. Hopefully we'll get disassembly. Edit: That won't work due to compression. Use this: |
Disassembly of affected functions (reposting from IRC): __list_add: https://pastebin.com/raw/M0tSULEV Note that I don't have the exact same .ko that produced the stack trace above so the zio_add_child is a manual recompile and disassembly after the fact. |
Debug build non-fatal crash dump: [464792.580697] VERIFY(zio->io_children[c][w] == 0) failed [464792.580893] PANIC at zio.c:3759:zio_done() [464792.581099] Showing stack for process 31381 [464792.581281] CPU: 14 PID: 31381 Comm: z_wr_iss Tainted: P O 4.9.21 #1 [464792.581616] Hardware name: Supermicro Super Server/X10DRD-LTP, BIOS 2.0 02/26/2016 [464792.581956] ffffc90026c4b6e0 ffffffff81315dec ffffffffa0930f83 0000000000000eaf [464792.582317] ffffc90026c4b6f0 ffffffffa0607fc4 ffffc90026c4b878 ffffffffa060808f [464792.582669] 0000000000000001 ffffc90000000028 ffffc90026c4b888 ffffc90026c4b828 [464792.583013] Call Trace: [464792.583205] [<ffffffff81315dec>] dump_stack+0x63/0x87 [464792.583405] [<ffffffffa0607fc4>] spl_dumpstack+0x44/0x50 [spl] [464792.583588] [<ffffffffa060808f>] spl_panic+0xbf/0xf0 [spl] [464792.583777] [<ffffffff816f0364>] ? __schedule+0x224/0x6a0 [464792.583965] [<ffffffff816f0a0a>] ? preempt_schedule_common+0x18/0x2e [464792.584149] [<ffffffff816f26d2>] ? mutex_lock+0x12/0x2f [464792.584564] [<ffffffffa083edc4>] zio_done+0x1344/0x1840 [zfs] [464792.584849] [<ffffffffa083f51d>] ? zio_ready+0x25d/0x630 [zfs] [464792.585091] [<ffffffffa083a9d1>] zio_nowait+0x121/0x320 [zfs] [464792.585408] [<ffffffffa0759fae>] dbuf_prefetch+0x45e/0x5f0 [zfs] [464792.585745] [<ffffffffa0764bcf>] dmu_prefetch+0x23f/0x260 [zfs] [464792.586081] [<ffffffffa07d1fb5>] space_map_load+0xf5/0x650 [zfs] [464792.586500] [<ffffffffa07abc49>] metaslab_load+0x69/0x1d0 [zfs] [464792.586756] [<ffffffffa07abfe9>] metaslab_activate+0x89/0x160 [zfs] [464792.587007] [<ffffffffa07aca31>] metaslab_alloc_dva.isra.9+0x531/0x10a0 [zfs] [464792.587413] [<ffffffffa07af6c4>] metaslab_alloc+0x124/0x460 [zfs] [464792.587672] [<ffffffffa083ae9e>] zio_dva_allocate+0x17e/0xa10 [zfs] [464792.587915] [<ffffffffa07b2e86>] ? zfs_refcount_add+0x16/0x20 [zfs] [464792.588148] [<ffffffffa07af449>] ? metaslab_class_throttle_reserve+0xb9/0x130 [zfs] [464792.588536] [<ffffffffa060ac66>] ? tsd_hash_search.isra.0+0x46/0x90 [spl] [464792.588729] [<ffffffffa060ad2e>] ? tsd_get_by_thread+0x2e/0x50 [spl] [464792.588910] [<ffffffffa0605238>] ? taskq_member+0x18/0x30 [spl] [464792.589205] [<ffffffffa0834285>] zio_execute+0xe5/0x2f0 [zfs] [464792.589389] [<ffffffffa06060a8>] taskq_thread+0x248/0x460 [spl] [464792.589580] [<ffffffff810ac090>] ? wake_up_q+0x80/0x80 [464792.589774] [<ffffffffa0605e60>] ? taskq_thread_spawn+0x50/0x50 [spl] [464792.589971] [<ffffffff8109f739>] kthread+0xd9/0xf0 [464792.590156] [<ffffffff8109f660>] ? kthread_park+0x60/0x60 [464792.590349] [<ffffffff8109f660>] ? kthread_park+0x60/0x60 [464792.590550] [<ffffffff816f5195>] ret_from_fork+0x25/0x30 [464968.688233] INFO: task z_wr_iss:31381 blocked for more than 120 seconds. [464968.688433] Tainted: P O 4.9.21 #1 [464968.688605] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [464968.688926] z_wr_iss D 0 31381 2 0x00000080 [464968.689105] ffff8820355187c0 0000000000000000 ffff880f62b2c440 ffff882026e88000 [464968.689441] ffff88203f399240 ffffc90026c4b6d8 ffffffff816f035c 0000000000000286 [464968.689772] ffffffffffffff10 ffffffff81315e00 ffff880f62b2c440 0000000000000eaf [464968.690111] Call Trace: [464968.690283] [<ffffffff816f035c>] ? __schedule+0x21c/0x6a0 [464968.690453] [<ffffffff81315e00>] ? dump_stack+0x77/0x87 [464968.690623] [<ffffffff816f0816>] schedule+0x36/0x80 [464968.690798] [<ffffffffa06080b5>] spl_panic+0xe5/0xf0 [spl] [464968.690966] [<ffffffff816f0364>] ? __schedule+0x224/0x6a0 [464968.691139] [<ffffffff816f0a0a>] ? preempt_schedule_common+0x18/0x2e [464968.691310] [<ffffffff816f26d2>] ? mutex_lock+0x12/0x2f [464968.691565] [<ffffffffa083edc4>] zio_done+0x1344/0x1840 [zfs] [464968.691772] [<ffffffffa083f51d>] ? zio_ready+0x25d/0x630 [zfs] [464968.691977] [<ffffffffa083a9d1>] zio_nowait+0x121/0x320 [zfs] [464968.692197] [<ffffffffa0759fae>] dbuf_prefetch+0x45e/0x5f0 [zfs] [464968.692396] [<ffffffffa0764bcf>] dmu_prefetch+0x23f/0x260 [zfs] [464968.692603] [<ffffffffa07d1fb5>] space_map_load+0xf5/0x650 [zfs] [464968.692808] [<ffffffffa07abc49>] metaslab_load+0x69/0x1d0 [zfs] [464968.693010] [<ffffffffa07abfe9>] metaslab_activate+0x89/0x160 [zfs] [464968.693230] [<ffffffffa07aca31>] metaslab_alloc_dva.isra.9+0x531/0x10a0 [zfs] [464968.693588] [<ffffffffa07af6c4>] metaslab_alloc+0x124/0x460 [zfs] [464968.693795] [<ffffffffa083ae9e>] zio_dva_allocate+0x17e/0xa10 [zfs] [464968.694000] [<ffffffffa07b2e86>] ? zfs_refcount_add+0x16/0x20 [zfs] [464968.694221] [<ffffffffa07af449>] ? metaslab_class_throttle_reserve+0xb9/0x130 [zfs] [464968.694548] [<ffffffffa060ac66>] ? tsd_hash_search.isra.0+0x46/0x90 [spl] [464968.694720] [<ffffffffa060ad2e>] ? tsd_get_by_thread+0x2e/0x50 [spl] [464968.694891] [<ffffffffa0605238>] ? taskq_member+0x18/0x30 [spl] [464968.695117] [<ffffffffa0834285>] zio_execute+0xe5/0x2f0 [zfs] [464968.695291] [<ffffffffa06060a8>] taskq_thread+0x248/0x460 [spl] [464968.695463] [<ffffffff810ac090>] ? wake_up_q+0x80/0x80 [464968.695631] [<ffffffffa0605e60>] ? taskq_thread_spawn+0x50/0x50 [spl] [464968.695804] [<ffffffff8109f739>] kthread+0xd9/0xf0 [464968.695969] [<ffffffff8109f660>] ? kthread_park+0x60/0x60 [464968.696140] [<ffffffff8109f660>] ? kthread_park+0x60/0x60 [464968.696310] [<ffffffff816f5195>] ret_from_fork+0x25/0x30 [464968.696531] INFO: task txg_sync:32375 blocked for more than 120 seconds. [464968.696699] Tainted: P O 4.9.21 #1 [464968.696863] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [464968.697188] txg_sync D 0 32375 2 0x00000080 [464968.697360] ffff8810108ff440 ffff880704da5540 ffff8810321c5b00 ffff8809f083ad80 [464968.697692] ffff88103fb99240 ffffc90021f57ac0 ffffffff816f035c ffff8810321c5b00 [464968.698023] 00000000ecdc0887 ffff881948b1f9e0 ffff8810321c5b00 0000000000000000 [464968.698360] Call Trace: [464968.698522] [<ffffffff816f035c>] ? __schedule+0x21c/0x6a0 [464968.698689] [<ffffffff816f0816>] schedule+0x36/0x80 [464968.698853] [<ffffffff816f3a3c>] schedule_timeout+0x21c/0x3a0 [464968.699058] [<ffffffffa0832d12>] ? zio_issue_async+0x12/0x20 [zfs] [464968.699275] [<ffffffffa083a9d1>] ? zio_nowait+0x121/0x320 [zfs] [464968.699446] [<ffffffff810f973c>] ? ktime_get+0x3c/0xb0 [464968.699613] [<ffffffff816f00d6>] io_schedule_timeout+0xa6/0x110 [464968.699784] [<ffffffffa060a133>] cv_wait_common+0xb3/0x130 [spl] [464968.699955] [<ffffffff810c63d0>] ? prepare_to_wait_event+0xf0/0xf0 [464968.700134] [<ffffffffa060a208>] __cv_wait_io+0x18/0x20 [spl] [464968.700357] [<ffffffffa083a0b3>] zio_wait+0x1e3/0x3e0 [zfs] [464968.700562] [<ffffffffa079b6ab>] dsl_pool_sync+0x11b/0x610 [zfs] [464968.700785] [<ffffffffa07bf0d6>] spa_sync+0x456/0x1220 [zfs] [464968.700952] [<ffffffff810ac0a2>] ? default_wake_function+0x12/0x20 [464968.701179] [<ffffffffa07d4f6c>] txg_sync_thread+0x2ec/0x570 [zfs] [464968.701387] [<ffffffffa07d4c80>] ? txg_init+0x2c0/0x2c0 [zfs] [464968.701557] [<ffffffffa0604f42>] thread_generic_wrapper+0x72/0x80 [spl] [464968.701727] [<ffffffffa0604ed0>] ? __thread_exit+0x20/0x20 [spl] [464968.701895] [<ffffffff8109f739>] kthread+0xd9/0xf0 [464968.702066] [<ffffffff8109f660>] ? kthread_park+0x60/0x60 [464968.702235] [<ffffffff8109f660>] ? kthread_park+0x60/0x60 [464968.702401] [<ffffffff816f5195>] ret_from_fork+0x25/0x30 System is currently running but hung. I can't leave it like this long. Kernel is also a semi-debug build. It was built with frame pointers. |
I have a crash dump from a kernel with this thread hung in the VERIFY() clause but don't know how to extract any useful data out of the backtrace. |
After talking with @DeHackEd in IRC, it turns out that he is using an Intel E5-2620 v4 and /proc/cpuinfo supports hle/rtm, which are Intel's TSX-NI. That is known to be broken (from what I had been told a year or two ago in #illumos) and likely is disabled in the latest BIOS update. I suggested DeHacked update his BIOS. The CentOS 7 kernel likely supports TSX-NI. I think it is possible that we are looking at the result of a broken Intel TSX-NI implementation. We'll see if this goes away after his BIOS update presumably disables TSX-NI. Edit: I cannot find TSX-NI support code inside the kernel itself. I still suspect that a CPU errata might be causing this, so a newer BIOS and the latest microcode could fix it. |
microcode update via early_ucode should strip the CPU off its support, I ran into issues however after doing so & had to re-compile my entire system (Gentoo), so if anything relies on it (march=native, etc.) - you might end up with a broken system - so: caveat emptor ! I also have a Haswell cpu and with a microcode update cat /proc/cpuinfo | grep -i hle and cat /proc/cpuinfo | grep -i rtm turn out empty |
After the BIOS update I have a new CPU flag - I'll look into a specific microcode update for Linux to apply as well. Edit: microcode acquired, date is after BIOS update. After installing the flags didn't change though. I'll just let this run. |
Crashed again anyway. Firmware updates didn't help. |
kernel: VERIFY3(zio->io_children[c][w] == 0) failed (1 == 0) Does this mean anything? |
Had a crash. This time I grabbed a lot more debugging information, including a kernel core dump with a debug-enabled kernel + debug-enabled ZFS. What's odd is that the assertion that fails (see above) turns out to be true after the fact. Possibly a race condition? Gist with the object dump of the zio_t being examined at crash time, and some info on what I did to get the dump (because crash is inferior to GDB in spite of using GDB internally). https://gist.github.com/DeHackEd/e09f447fbdcdfa3d014f1236490d3715 |
Going over my crash dump history, all crashes were either from L2ARC or from the metaslab code. Broadly speaking the L2ARC code crashes when |
Commenting that removal of the L2ARC seems to have stabilized the system. I want to give it at least 10 days total (~6 days to go) before I jump for joy though. |
Okay, I'm calling that removal of the SSD fixed the problem. Now, what could cause a zio crash/ASSERT failure when L2ARC is involved? More specifically, in the metaslab/spacemap layer. |
Just going over my old issues. This has completely gone away since the removal of the L2ARC device. I'm using metadata allocation classes instead. |
I am seeing what I think is the same bug after switching from a EL6-ish kernel to an EL7-ish one:
System information
Disaabling L2ARC is not really an option for us. Any suggestions on what to do? |
The kernel version thing is interesting. It suggests it's a bisectable kernel issue. If I can reproduce this consistently I could do a bisect. Is your L2ARC hit rate really high enough that dropping it will ruin performance that badly? |
I don't currently have the hardware in place to test (nor reproduce) the issue but given this seems to have something to do with l2arc/prefetch code it could be explained by "Illumos 8857 - zio_remove_child() panic due to already destroyed parent zio" (Fix PR: openzfs/openzfs#505) |
That looks very promising. And if that gets accepted, it should be added to our stable branch(es?) as well. |
…nt zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio |--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio |--- lio |--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio |--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Youzhong Yang <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: openzfs/openzfs@862ff6d99c Issue #5918 Closes #7168
The OpenZFS fix for this issue has been merged. If anyone has been able to consistently reproduce this issue it would be great if you could verify it's now resolved in master. This fix has been added to the list of proposed changes for the next 0.7 release. |
Unfortunately not me. That system is now running metadata allocation classes instead so there's no SSD space left for L2ARC. |
Slight edit, another machine is exhibiting the same symptoms with an L2ARC. Maybe I can. (I absentmindedly removed the L2ARC as a fix before I remembered this issue was still outstanding)... But this will likely take some time. It's around a month between crashes. And I'll have to get approval from above before I leave a machine in a potentially crash-prone state. |
…nt zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio |--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio |--- lio |--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio |--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Youzhong Yang <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: openzfs/openzfs@862ff6d99c Issue openzfs#5918 Closes openzfs#7168
…nt zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio |--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio |--- lio |--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio |--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Youzhong Yang <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: openzfs/openzfs@862ff6d99c Issue openzfs#5918 Closes openzfs#7168
…nt zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio |--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio |--- lio |--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio |--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Youzhong Yang <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: openzfs/openzfs@862ff6d99c Issue openzfs#5918 Closes openzfs#7168
…nt zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio |--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio |--- lio |--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio |--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Reviewed by: Andriy Gapon <[email protected]> Reviewed by: Youzhong Yang <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Approved by: Dan McDonald <[email protected]> Ported-by: Giuseppe Di Natale <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: openzfs/openzfs@862ff6d99c Issue #5918 Closes #7168
Reporting general stability after 24 days. I'm willing to give this a "probably fixed" at this point. Will re-open if another crash happens. |
(Edit: title was changed)
System information
Describe the problem you're observing
Been experiencing kernel crashes intermittently lately. Panic, hard reboot required. Stack trace is below.
Describe how to reproduce the problem
The only consistency is a pool that's just over 80% full. Usually purging a large amount of space will make it stable for an extended period.
Pool receives nearly continuous writes of around 150-200 megabytes/second. Files written are loosely from 1 to 4 megabytes large each on average.
/etc/modprobe.d/zfs.conf:
Pool layout:
Include any warning/errors/backtraces from the system logs
Output was reassembled from a netconsole capture so the wrapping had to be manually re-done. It's not exactly perfect.
The text was updated successfully, but these errors were encountered: