Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nasty deadlock on zpool import #4006

Closed
mihaivint opened this issue Nov 11, 2015 · 9 comments
Closed

Nasty deadlock on zpool import #4006

mihaivint opened this issue Nov 11, 2015 · 9 comments

Comments

@mihaivint
Copy link

I have a new issue today. One of the nodes locked. after an update. Not sure what triggered it, as others didn't have the same issue. Anyway on this node i had 0.6.5.1 now i've updated to 0.6.5.3 but behavior is the same zfs is not able to import the pool and hangs everything.
Environment is a VMware vm:
free -m
total used free shared buff/cache available
Mem: 7823 474 7050 8 298 5877
Swap: 4095 0 4095
cat /proc/cpuinfo |grep -ic process
4
uname -a
Linux f34bd24f63.es.private.redlight.hubgets.com 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

  145.183846] CPU: 2 PID: 0 Comm: swapper/2 Tainted: PF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[  145.183847] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[  145.183848] task: ffff8802341b5b00 ti: ffff880234200000 task.ti: ffff880234200000
[  145.183850] RIP: 0010:[<ffffffff81052caa>]  [<ffffffff81052caa>] native_write_msr_safe+0xa/0x10
[  145.183856] RSP: 0018:ffff88023fd03d80  EFLAGS: 00000046
[  145.183857] RAX: 0000000000000400 RBX: 0000000000000002 RCX: 0000000000000830
[  145.183858] RDX: 0000000000000002 RSI: 0000000000000400 RDI: 0000000000000830
[  145.183859] RBP: ffff88023fd03d80 R08: ffffffff81a21700 R09: 000000000000059c
[  145.183860] R10: 61206f7420494d4e R11: 3a73555043206c6c R12: ffffffff81a21700
[  145.183861] R13: 0000000000000002 R14: 000000000000a022 R15: 0000000000000002
[  145.183863] FS:  0000000000000000(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[  145.183864] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  145.183865] CR2: 00007f39bf015f90 CR3: 00000002327b2000 CR4: 00000000001407e0
[  145.183893] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  145.183907] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  145.183907] Stack:
[  145.183908]  ffff88023fd03dd0 ffffffff8104a322 0000000000000086 0000000000000002
[  145.183911]  000800003fd03db0 0000000000002710 ffffffff819626c0 ffffffff819626c0
[  145.183913]  ffff88023fd0de00 ffffffff819627c0 ffff88023fd03de0 ffffffff8104a36c
[  145.183915] Call Trace:
[  145.183916]  <IRQ>

[  145.183921]  [<ffffffff8104a322>] __x2apic_send_IPI_mask+0xb2/0xe0
[  145.183924]  [<ffffffff8104a36c>] x2apic_send_IPI_all+0x1c/0x20
[  145.183928]  [<ffffffff81045e85>] arch_trigger_all_cpu_backtrace+0x65/0xa0
[  145.183931]  [<ffffffff81115d28>] rcu_check_callbacks+0x5a8/0x5f0
[  145.183934]  [<ffffffff81080f47>] update_process_times+0x47/0x80
[  145.183938]  [<ffffffff810d0525>] tick_sched_handle.isra.16+0x25/0x60
[  145.183939]  [<ffffffff810d05a1>] tick_sched_timer+0x41/0x60
[  145.183943]  [<ffffffff8109b1b7>] __run_hrtimer+0x77/0x1d0
[  145.183945]  [<ffffffff810d0560>] ? tick_sched_handle.isra.16+0x60/0x60
[  145.183948]  [<ffffffff8109b9f7>] hrtimer_interrupt+0xf7/0x240
[  145.183950]  [<ffffffff810441c7>] local_apic_timer_interrupt+0x37/0x60
[  145.183953]  [<ffffffff8161698f>] smp_apic_timer_interrupt+0x3f/0x60
[  145.183957]  [<ffffffff8161505d>] apic_timer_interrupt+0x6d/0x80
[  145.183958]  <EOI>

[  145.183960]  [<ffffffff8109b838>] ? hrtimer_start+0x18/0x20
[  145.183963]  [<ffffffff81052de6>] ? native_safe_halt+0x6/0x10
[  145.183966]  [<ffffffff8101c85f>] default_idle+0x1f/0xc0
[  145.183969]  [<ffffffff8101d166>] arch_cpu_idle+0x26/0x30
[  145.183972]  [<ffffffff810c6921>] cpu_startup_entry+0xf1/0x290
[  145.183974]  [<ffffffff8104228a>] start_secondary+0x1ba/0x230
[  145.183975] Code: 00 55 89 f9 48 89 e5 0f 32 31 c9 89 c0 48 c1 e2 20 89 0e 48 09 c2 48 89 d0 5d c3 66 0f 1f 44 00 00 55 89 f0 89 f9 48 89 e5 0f 30 <31> c0 5d c3 66 90 55 89 f9 48 89 e5 0f 33 89 c0 48 c1 e2 20 48
[  145.183994] NMI backtrace for cpu 1
[  145.183999] CPU: 1 PID: 0 Comm: swapper/1 Tainted: PF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[  145.184001] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[  145.184002] task: ffff8802341b4fa0 ti: ffff8802341fc000 task.ti: ffff8802341fc000
[  145.184004] RIP: 0010:[<ffffffff81052de6>]  [<ffffffff81052de6>] native_safe_halt+0x6/0x10
[  145.184008] RSP: 0018:ffff8802341ffe90  EFLAGS: 00000286
[  145.184010] RAX: 00000000ffffffed RBX: ffff8802341fffd8 RCX: 0100000000000000
[  145.184011] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[  145.184012] RBP: ffff8802341ffe90 R08: 0000000000000000 R09: 0000000000000000
[  145.184013] R10: 0000000000000000 R11: ffff8800b8cabf30 R12: 0000000000000001
[  145.184013] R13: ffffffff81a21700 R14: 0000000000000000 R15: ffff8802341fffd8
[  145.184015] FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
[  145.184016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  145.184017] CR2: 00007f39befa81c4 CR3: 00000002327b2000 CR4: 00000000001407e0
[  145.184036] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  145.184050] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  145.184051] Stack:
[  145.184052]  ffff8802341ffeb0 ffffffff8101c85f ffff8802341fffd8 0000000000000000
[  145.184054]  ffff8802341ffec0 ffffffff8101d166 ffff8802341fff20 ffffffff810c6921
[  145.184056]  ffff8802341fffd8 ffff8802341fffd8 ffff8802341fffd8 f1fa897016c5058b
[  145.184058] Call Trace:
[  145.184062]  [<ffffffff8101c85f>] default_idle+0x1f/0xc0
[  145.184064]  [<ffffffff8101d166>] arch_cpu_idle+0x26/0x30
[  145.184067]  [<ffffffff810c6921>] cpu_startup_entry+0xf1/0x290
[  145.184070]  [<ffffffff8104228a>] start_secondary+0x1ba/0x230
[  145.184071] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
[  145.184090] NMI backtrace for cpu 0
[  145.184095] CPU: 0 PID: 4241 Comm: mount.zfs Tainted: PF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[  145.184096] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[  145.184098] task: ffff88022db3cfa0 ti: ffff88021a394000 task.ti: ffff88021a394000
[  145.184099] RIP: 0010:[<ffffffff8160b8ca>]  [<ffffffff8160b8ca>] _raw_spin_lock_irq+0x3a/0x60
[  145.184106] RSP: 0018:ffff88021a3975a8  EFLAGS: 00000097
[  145.184107] RAX: 00000000000060f2 RBX: ffff88022db3cfa0 RCX: 00000000000004dd
[  145.184108] RDX: 0000000000000004 RSI: 0000000000000004 RDI: ffff880231e7d0b8
[  145.184109] RBP: ffff88021a3975a8 R08: 0000000000000202 R09: 0000000000000000
[  145.184110] R10: ffffea0008c69000 R11: ffffffffa000a75a R12: ffff880231e7d0b0
[  145.184112] R13: ffffffffffffffff R14: ffff880231e7d0b8 R15: ffff880231e7d000
[  145.184114] FS:  00007f0561644780(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
[  145.184115] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  145.184116] CR2: 00007fa6ede870a0 CR3: 0000000232113000 CR4: 00000000001407f0
[  145.184134] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  145.184148] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  145.184149] Stack:
[  145.184150]  ffff88021a397618 ffffffff8160b313 ffff8802321a9ea8 ffffffffa0181b40
[  145.184152]  ffff8802191ebbc0 ffff88022db3cfa0 ffff880200000001 ffff8802321a9ea8
[  145.184154]  000000003ffbca70 ffff88021a3976a8 ffff880231e7d0b0 0000000000000002
[  145.184156] Call Trace:
[  145.184160]  [<ffffffff8160b313>] rwsem_down_read_failed+0x53/0x165
[  145.184179]  [<ffffffff812e2e54>] call_rwsem_down_read_failed+0x14/0x30
[  145.184190]  [<ffffffffa000a75a>] ? spl_kmem_free+0x2a/0x40 [spl]
[  145.184192]  [<ffffffff81608c90>] ? down_read+0x20/0x30
[  145.184227]  [<ffffffffa013234c>] zap_get_leaf_byblk+0xec/0x2c0 [zfs]
[  145.184230]  [<ffffffff81607c72>] ? mutex_lock+0x12/0x2f
[  145.184253]  [<ffffffffa013259a>] zap_deref_leaf+0x7a/0xa0 [zfs]
[  145.184275]  [<ffffffffa01342fd>] fzap_cursor_retrieve+0x13d/0x2d0 [zfs]
[  145.184298]  [<ffffffffa0137ca4>] zap_cursor_retrieve+0x64/0x320 [zfs]
[  145.184302]  [<ffffffff8126df2a>] ? selinux_inode_alloc_security+0x3a/0xa0
[  145.184324]  [<ffffffffa0142323>] zfs_purgedir+0x93/0x220 [zfs]
[  145.184346]  [<ffffffffa010a7fc>] ? sa_lookup+0x9c/0xc0 [zfs]
[  145.184370]  [<ffffffffa01608f4>] ? zfs_inode_update+0x1a4/0x1b0 [zfs]
[  145.184373]  [<ffffffff81098235>] ? wake_up_bit+0x25/0x30
[  145.184377]  [<ffffffff811e1290>] ? unlock_new_inode+0x50/0x70
[  145.184398]  [<ffffffffa0160d37>] ? zfs_znode_alloc+0x437/0x550 [zfs]
[  145.184420]  [<ffffffffa0142722>] zfs_rmnode+0x272/0x350 [zfs]
[  145.184423]  [<ffffffff81607c72>] ? mutex_lock+0x12/0x2f
[  145.184444]  [<ffffffffa0163d68>] zfs_zinactive+0x168/0x180 [zfs]
[  145.184465]  [<ffffffffa015d4a7>] zfs_inactive+0x67/0x240 [zfs]
[  145.184469]  [<ffffffff81166979>] ? truncate_pagecache+0x59/0x60
[  145.184490]  [<ffffffffa0174a43>] zpl_evict_inode+0x43/0x60 [zfs]
[  145.184494]  [<ffffffff811e2157>] evict+0xa7/0x170
[  145.184497]  [<ffffffff811e2995>] iput+0xf5/0x180
[  145.184517]  [<ffffffffa01417f5>] zfs_unlinked_drain+0xb5/0xf0 [zfs]
[  145.184521]  [<ffffffff810980f6>] ? finish_wait+0x56/0x70
[  145.184526]  [<ffffffffa000e2a8>] ? taskq_create+0x228/0x370 [spl]
[  145.184529]  [<ffffffff81098240>] ? wake_up_bit+0x30/0x30
[  145.184550]  [<ffffffffa015ebf0>] ? zfs_get_done+0x70/0x70 [zfs]
[  145.184570]  [<ffffffffa0165e72>] ? zil_open+0x42/0x60 [zfs]
[  145.184591]  [<ffffffffa0155778>] zfs_sb_setup+0x138/0x170 [zfs]
[  145.184612]  [<ffffffffa0156523>] zfs_domount+0x2d3/0x360 [zfs]
[  145.184633]  [<ffffffffa0174b60>] ? zpl_kill_sb+0x20/0x20 [zfs]
[  145.184652]  [<ffffffffa0174b8c>] zpl_fill_super+0x2c/0x40 [zfs]
[  145.184656]  [<ffffffff811c9e9d>] mount_nodev+0x4d/0xb0
[  145.184675]  [<ffffffffa0175062>] zpl_mount+0x52/0x80 [zfs]
[  145.184679]  [<ffffffff811ca9e9>] mount_fs+0x39/0x1b0
[  145.184682]  [<ffffffff811e604f>] vfs_kern_mount+0x5f/0xf0
[  145.184684]  [<ffffffff811e859e>] do_mount+0x24e/0xa40
[  145.184688]  [<ffffffff8115b53e>] ? __get_free_pages+0xe/0x50
[  145.184690]  [<ffffffff811e8e26>] SyS_mount+0x96/0xf0
[  145.184693]  [<ffffffff81614409>] system_call_fastpath+0x16/0x1b
[  145.184694] Code: 00 b8 00 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00
[  145.184714] NMI backtrace for cpu 3
[  145.184719] CPU: 3 PID: 0 Comm: swapper/3 Tainted: PF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[  145.184720] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[  145.184723] task: ffff8802341b6660 ti: ffff880234208000 task.ti: ffff880234208000
[  145.184724] RIP: 0010:[<ffffffff81052de6>]  [<ffffffff81052de6>] native_safe_halt+0x6/0x10
[  145.184728] RSP: 0018:ffff88023420be90  EFLAGS: 00000286
[  145.184730] RAX: 00000000ffffffed RBX: ffff88023420bfd8 RCX: 0100000000000000
[  145.184731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[  145.184732] RBP: ffff88023420be90 R08: 0000000000000000 R09: 0000000000000000
[  145.184732] R10: 0000000000000000 R11: ffff8800bbbc7f30 R12: 0000000000000003
[  145.184733] R13: ffffffff81a21700 R14: 0000000000000000 R15: ffff88023420bfd8
[  145.184736] FS:  0000000000000000(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
[  145.184737] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  145.184738] CR2: 00007f39bf9a3000 CR3: 00000002327b2000 CR4: 00000000001407e0
[  145.184757] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  145.184771] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  145.184772] Stack:
[  145.184773]  ffff88023420beb0 ffffffff8101c85f ffff88023420bfd8 0000000000000000
[  145.184775]  ffff88023420bec0 ffffffff8101d166 ffff88023420bf20 ffffffff810c6921
[  145.184777]  ffff88023420bfd8 ffff88023420bfd8 ffff88023420bfd8 e8892c2df8f49441
[  145.184779] Call Trace:
[  145.184782]  [<ffffffff8101c85f>] default_idle+0x1f/0xc0
[  145.184785]  [<ffffffff8101d166>] arch_cpu_idle+0x26/0x30
[  145.184787]  [<ffffffff810c6921>] cpu_startup_entry+0xf1/0x290
[  145.184791]  [<ffffffff8104228a>] start_secondary+0x1ba/0x230
[  145.184792] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
[  172.637158] ------------[ cut here ]------------
[  172.637172] WARNING: at net/sched/sch_generic.c:259 dev_watchdog+0x270/0x280()
[  172.637174] NETDEV WATCHDOG: eth0 (vmxnet3): transmit queue 0 timed out
[  172.637176] Modules linked in: binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ppdev ghash_clmulni_intel cryptd vmw_balloon serio_raw pcspkr parport_pc shpchp parport i2c_piix4 vmw_vmci xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif ata_generic crct10dif_common pata_acpi vmwgfx drm_kms_helper ttm vmxnet3 drm ata_piix vmw_pvscsi libata i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) zlib_deflate
[  172.637208] CPU: 3 PID: 0 Comm: swapper/3 Tainted: PF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[  172.637211] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[  172.637213]  ffff88023fd83d88 e8892c2df8f49441 ffff88023fd83d40 ffffffff816045b6
[  172.637216]  ffff88023fd83d78 ffffffff8106e29b 0000000000000000 ffff880232800000
[  172.637219]  ffff88022f662440 0000000000000004 0000000000000003 ffff88023fd83de0
[  172.637222] Call Trace:
[  172.637224]  <IRQ>  [<ffffffff816045b6>] dump_stack+0x19/0x1b
[  172.637235]  [<ffffffff8106e29b>] warn_slowpath_common+0x6b/0xb0
[  172.637238]  [<ffffffff8106e33c>] warn_slowpath_fmt+0x5c/0x80
[  172.637242]  [<ffffffff81099de1>] ? run_posix_cpu_timers+0x51/0x840
[  172.637246]  [<ffffffff8151d830>] dev_watchdog+0x270/0x280
[  172.637249]  [<ffffffff8151d5c0>] ? dev_graft_qdisc+0x80/0x80
[  172.637254]  [<ffffffff8107df66>] call_timer_fn+0x36/0x110
[  172.637257]  [<ffffffff8151d5c0>] ? dev_graft_qdisc+0x80/0x80
[  172.637260]  [<ffffffff8107fddf>] run_timer_softirq+0x21f/0x320
[  172.637263]  [<ffffffff81077b3f>] __do_softirq+0xef/0x280
[  172.637266]  [<ffffffff81615d1c>] call_softirq+0x1c/0x30
[  172.637272]  [<ffffffff81015d95>] do_softirq+0x65/0xa0
[  172.637275]  [<ffffffff81077ed5>] irq_exit+0x115/0x120
[  172.637278]  [<ffffffff81616995>] smp_apic_timer_interrupt+0x45/0x60
[  172.637282]  [<ffffffff8161505d>] apic_timer_interrupt+0x6d/0x80
[  172.637283]  <EOI>  [<ffffffff8109b838>] ? hrtimer_start+0x18/0x20
[  172.637290]  [<ffffffff81052de6>] ? native_safe_halt+0x6/0x10
[  172.637293]  [<ffffffff8101c85f>] default_idle+0x1f/0xc0
[  172.637296]  [<ffffffff8101d166>] arch_cpu_idle+0x26/0x30
[  172.637301]  [<ffffffff810c6921>] cpu_startup_entry+0xf1/0x290
[  172.637305]  [<ffffffff8104228a>] start_secondary+0x1ba/0x230
[  172.637308] ---[ end trace 93cb2e24d4d03d76 ]---
[  172.637312] vmxnet3 0000:0b:00.0 eth0: tx hang
[  172.644382] vmxnet3 0000:0b:00.0 eth0: resetting
[  172.651711] vmxnet3 0000:0b:00.0 eth0: intr type 3, mode 0, 5 vectors allocated
[  172.652817] vmxnet3 0000:0b:00.0 eth0: NIC Link is Up 10000 Mbps
[  257.579673] vmxnet3 0000:0b:00.0 eth0: tx hang
[  257.580025] vmxnet3 0000:0b:00.0 eth0: resetting
[  257.587729] vmxnet3 0000:0b:00.0 eth0: intr type 3, mode 0, 5 vectors allocated
[  257.588981] vmxnet3 0000:0b:00.0 eth0: NIC Link is Up 10000 Mbps
@mihaivint
Copy link
Author

I had to boot without zfs and performed a zpool import, and this is how i've obtained the output:
``
zpool import
pool: data
id: 14971036982025492324
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

    data        ONLINE
      sdb       ONLINE

zpool import -f data

@mihaivint
Copy link
Author

And here is one without selinux enabled so that is not the fault

  214.227459] NMI backtrace for cpu 3
[  214.227463] CPU: 3 PID: 4432 Comm: mount.zfs Tainted: PF          O--------------   3.10.0-229.20.1.el7.x86_64 #1
[  214.227465] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
[  214.227466] task: ffff8802322f6660 ti: ffff88021ace0000 task.ti: ffff88021ace0000
[  214.227468] RIP: 0010:[<ffffffff8160b8ca>]  [<ffffffff8160b8ca>] _raw_spin_lock_irq+0x3a/0x60
[  214.227474] RSP: 0018:ffff88021ace35a8  EFLAGS: 00000097
[  214.227475] RAX: 0000000000001087 RBX: ffff8802322f6660 RCX: 00000000000004dd
[  214.227476] RDX: 0000000000000004 RSI: 0000000000000004 RDI: ffff88023005d8b8
[  214.227477] RBP: ffff88021ace35a8 R08: 0000000000000202 R09: 0000000000000000
[  214.227478] R10: ffffea0008cb8000 R11: ffffffffa000a75a R12: ffff88023005d8b0
[  214.227479] R13: ffffffffffffffff R14: ffff88023005d8b8 R15: ffff88023005d800
[  214.227481] FS:  00007f9692f5e780(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000
[  214.227483] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.227484] CR2: 00007f969225bc10 CR3: 00000002326d3000 CR4: 00000000001407e0
[  214.227502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  214.227516] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  214.227517] Stack:
[  214.227518]  ffff88021ace3618 ffffffff8160b313 ffff8802327c6c48 ffffffffa0181b40
[  214.227521]  ffff8802196c6000 ffff8802322f6660 ffff880200000001 ffff8802327c6c48
[  214.227523]  0000000005000266 ffff88021ace36a8 ffff88023005d8b0 0000000000000002
[  214.227525] Call Trace:
[  214.227529]  [<ffffffff8160b313>] rwsem_down_read_failed+0x53/0x165
[  214.227558]  [<ffffffff812e2e54>] call_rwsem_down_read_failed+0x14/0x30
[  214.227565]  [<ffffffffa000a75a>] ? spl_kmem_free+0x2a/0x40 [spl]
[  214.227569]  [<ffffffff81608c90>] ? down_read+0x20/0x30
[  214.227602]  [<ffffffffa013234c>] zap_get_leaf_byblk+0xec/0x2c0 [zfs]
[  214.227606]  [<ffffffff81607c72>] ? mutex_lock+0x12/0x2f
[  214.227633]  [<ffffffffa013259a>] zap_deref_leaf+0x7a/0xa0 [zfs]
[  214.227660]  [<ffffffffa01342fd>] fzap_cursor_retrieve+0x13d/0x2d0 [zfs]
[  214.227688]  [<ffffffffa0137ca4>] zap_cursor_retrieve+0x64/0x320 [zfs]
[  214.227714]  [<ffffffffa0142323>] zfs_purgedir+0x93/0x220 [zfs]
[  214.227740]  [<ffffffffa010a7fc>] ? sa_lookup+0x9c/0xc0 [zfs]
[  214.227766]  [<ffffffffa01608f4>] ? zfs_inode_update+0x1a4/0x1b0 [zfs]
[  214.227770]  [<ffffffff81098235>] ? wake_up_bit+0x25/0x30
[  214.227773]  [<ffffffff811e1290>] ? unlock_new_inode+0x50/0x70
[  214.227798]  [<ffffffffa0160d37>] ? zfs_znode_alloc+0x437/0x550 [zfs]
[  214.227825]  [<ffffffffa0142722>] zfs_rmnode+0x272/0x350 [zfs]
[  214.227828]  [<ffffffff81607c72>] ? mutex_lock+0x12/0x2f
[  214.227853]  [<ffffffffa0163d68>] zfs_zinactive+0x168/0x180 [zfs]
[  214.227879]  [<ffffffffa015d4a7>] zfs_inactive+0x67/0x240 [zfs]
[  214.227882]  [<ffffffff81166979>] ? truncate_pagecache+0x59/0x60
[  214.227907]  [<ffffffffa0174a43>] zpl_evict_inode+0x43/0x60 [zfs]
[  214.227912]  [<ffffffff811e2157>] evict+0xa7/0x170
[  214.227914]  [<ffffffff811e2995>] iput+0xf5/0x180
[  214.227940]  [<ffffffffa01417f5>] zfs_unlinked_drain+0xb5/0xf0 [zfs]
[  214.227943]  [<ffffffff810980f6>] ? finish_wait+0x56/0x70
[  214.227949]  [<ffffffffa000e2a8>] ? taskq_create+0x228/0x370 [spl]
[  214.227952]  [<ffffffff81098240>] ? wake_up_bit+0x30/0x30
[  214.227977]  [<ffffffffa015ebf0>] ? zfs_get_done+0x70/0x70 [zfs]
[  214.228002]  [<ffffffffa0165e72>] ? zil_open+0x42/0x60 [zfs]
[  214.228028]  [<ffffffffa0155778>] zfs_sb_setup+0x138/0x170 [zfs]
[  214.228053]  [<ffffffffa0156523>] zfs_domount+0x2d3/0x360 [zfs]
[  214.228078]  [<ffffffffa0174b60>] ? zpl_kill_sb+0x20/0x20 [zfs]
[  214.228104]  [<ffffffffa0174b8c>] zpl_fill_super+0x2c/0x40 [zfs]
[  214.228107]  [<ffffffff811c9e9d>] mount_nodev+0x4d/0xb0
[  214.228132]  [<ffffffffa0175062>] zpl_mount+0x52/0x80 [zfs]
[  214.228136]  [<ffffffff811ca9e9>] mount_fs+0x39/0x1b0
[  214.228139]  [<ffffffff811e604f>] vfs_kern_mount+0x5f/0xf0
[  214.228141]  [<ffffffff811e859e>] do_mount+0x24e/0xa40
[  214.228145]  [<ffffffff8115b53e>] ? __get_free_pages+0xe/0x50
[  214.228148]  [<ffffffff811e8e26>] SyS_mount+0x96/0xf0
[  214.228151]  [<ffffffff81614409>] system_call_fastpath+0x16/0x1b

@tuxoko
Copy link
Contributor

tuxoko commented Nov 13, 2015

@mihaivint
What ZFS version was this pool created?

We seems to have similar issues in the past, but I'm not quite sure if it's fixed or not.

@mihaivint
Copy link
Author

I'm not exactly sure if it was directly on this 0.6.5.1 or it was on 0.6.4.1 and then upgraded

@neingeist
Copy link

Same here with my pool: #3814
My pool was created with 0.6.4 or earlier.

@mihaivint
Copy link
Author

At least there is a way to recover the data, in this case there wasn't a need as i had a replica of the data, but good to know the ro mount

@behlendorf
Copy link
Contributor

@mihaivint can you apply the patches in #4123, they are safe and should resolve the problem. Definitely let us know if that doesn't fix things.

@mihaivint
Copy link
Author

Unfortunatelly i don't have that drive to test this, so i can't provide additional info.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Dec 28, 2015
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4114
Issue openzfs#4052
Issue openzfs#4006
Issue openzfs#3018
Issue openzfs#2861
nedbass pushed a commit that referenced this issue Dec 31, 2015
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #4114
Closes #4052
Closes #4006
Closes #3018
Closes #2861
nedbass pushed a commit that referenced this issue Dec 31, 2015
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #4114
Issue #4052
Issue #4006
Issue #3018
Issue #2861
ryao pushed a commit to ryao/zfs that referenced this issue Jan 4, 2016
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#4114
Closes openzfs#4052
Closes openzfs#4006
Closes openzfs#3018
Closes openzfs#2861
ryao pushed a commit to ryao/zfs that referenced this issue Jan 4, 2016
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4114
Issue openzfs#4052
Issue openzfs#4006
Issue openzfs#3018
Issue openzfs#2861
@neingeist
Copy link

The fixes also worked for me.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jan 8, 2016
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#4114
Closes openzfs#4052
Closes openzfs#4006
Closes openzfs#3018
Closes openzfs#2861
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jan 8, 2016
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4114
Issue openzfs#4052
Issue openzfs#4006
Issue openzfs#3018
Issue openzfs#2861
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Jan 17, 2016
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#4114
Closes openzfs#4052
Closes openzfs#4006
Closes openzfs#3018
Closes openzfs#2861
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Jan 17, 2016
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4114
Issue openzfs#4052
Issue openzfs#4006
Issue openzfs#3018
Issue openzfs#2861
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Feb 4, 2016
During zfs_rmnode on a xattr dir, if the system crash just after
dmu_free_long_range, we would get empty xattr dir in delete queue. This would
cause blkid=0 be passed into zap_get_leaf_byblk when doing zfs_purgedir during
mount, and would try to do rw_enter on a wrong structure and cause system
lockup.

We fix this by returning ENOENT when blkid is zero in zap_get_leaf_byblk.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#4114
Closes openzfs#4052
Closes openzfs#4006
Closes openzfs#3018
Closes openzfs#2861
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Feb 4, 2016
We need truncate and remove be in the same tx when doing zfs_rmnode on xattr
dir. Otherwise, if we truncate and crash, we'll end up with inconsistent zap
object on the delete queue. We do this by skipping dmu_free_long_range and let
zfs_znode_delete to do the work.

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#4114
Issue openzfs#4052
Issue openzfs#4006
Issue openzfs#3018
Issue openzfs#2861
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants