Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs send -R from FreeNAS9 and receive on a ZoL result in hard lock #2602

Closed
FransUrbo opened this issue Aug 14, 2014 · 79 comments
Closed

zfs send -R from FreeNAS9 and receive on a ZoL result in hard lock #2602

FransUrbo opened this issue Aug 14, 2014 · 79 comments
Milestone

Comments

@FransUrbo
Copy link
Contributor

I have no other information that that. It seems to be completely random and I have not been able to reproduce it.

I first thought it was ryao/zvol (#2484, some information of value might be there, including the only 'screenshot' (literally!) I have), but it happens on an absolutly clean master as well.

First time it crashed (running #2484 and some of my own PRs - see #2484 (comment)), I was able to import the pool, but not run any zfs command (such as a simple zfs list without hard locking the kernel).

Second time, running clean ZoL, it crashed after all datasets and got 20 ZVOLs in.

Third time, running ONLY #2484, it crashed 18 ZVOLs in (all datasets sent just fine before the ZVOLs).

After a reboot and import, I'm now sending ONLY the ZVOLs (all the other data have already been sent successfully). Crashed again, and zfs list (again!) hangs the kernel after about 20 seconds..

Using the recommended #2338 do not help.

@FransUrbo
Copy link
Contributor Author

Tried again, and yet again I woke up to a hard locked kernel... And now the 'import' (WITH zvol_inhibit_dev=1 and zfs_autoimport_disable=1 set!) fails with a hard lock as well...

What could cause a simple send+receive to fuck up the pool so badly that it becomes impossible to import!?

@behlendorf behlendorf added this to the 0.6.4 milestone Aug 18, 2014
@behlendorf behlendorf added the Bug label Aug 18, 2014
@behlendorf
Copy link
Contributor

Screenshot from failure.

726e7584-20ab-11e4-91dd-3a96a097f99c

@behlendorf
Copy link
Contributor

@FransUrbo The only thing which will cause a hard lockup is if a cpu is spinning in a function with interrupts disabled. If you're seeing this on every import can you get the backtrace from the console. In particular, the chunk of the dump which includes the EIP is at ... line.

You could also try disabling the nmi watchdog. This might allow the system to boot as long as the code we're spinning in does eventually terminate. We just need to identify what the block of code is. This could perhaps be done by running perf record when it's spinning and the watchdog is disabled. Set the nmi_watchdog=0 kernel option to disable the watchdog.

@behlendorf
Copy link
Contributor

@FransUrbo This is a complete and total guess. But pull #2614 is designed to minimize the number of reads in zvol_get_stats() which is where your stack trace is showing an issue. You might give it a try.

@FransUrbo
Copy link
Contributor Author

If you're seeing this on every import can you get the backtrace from the console.

I'm NOT seeing this on every import. I have not been able to figure out when it hangs at import and when it hangs on the subsequent 'zfs list'...

It WILL hard lock eventually non the less. But my terminal is to small (and I have not been able to increase the virtual terminal size - can't find a fb module that works on my graphic card).

And the SysRQ keys don't work at this point unfortunately.

You could also try disabling the nmi watchdog.

I'll try that, thanx.

All datasets have now been synced (for the umpteenth time - have to recreate and resync everything from scratch every time) and it's only the ZVOLs missing.

I tried last time to go back to a TXG that just before the ZVOLs, but then I got 'missing device' or something like that...

How do I figure out the current TXG? Looking at "zdb | grep -w txg:", I see:

    txg: 249923

Is that the txg I should use if (when) it crashes again, so that I don't have to do everything from scratch again?

@FransUrbo
Copy link
Contributor Author

@FransUrbo This is a complete and total guess. But pull #2614 is designed to minimize the number of reads in zvol_get_stats() which is where your stack trace is showing an issue. You might give it a try.

Can't find #2614 (latest is 2611 :).

@behlendorf
Copy link
Contributor

It WILL hard lock eventually non the less. But my terminal is to small...

Sometimes you can use shift-page-up and shift-page-down on the terminal to scroll back a few lines.

Can't find #2614 (latest is 2611 :).

Sorry, type I meant #2612.

@FransUrbo
Copy link
Contributor Author

No EIP at ...., but this is what is in the logs now:

[ 5205.613819] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=15002 jiffies, g=19657, c=19656, q=525)
[ 5205.613918] sending NMI to all CPUs:
[ 5205.613923] NMI backtrace for cpu 0
[ 5205.613927] CPU: 0 PID: 9150 Comm: zfs Tainted: P           O  3.15.0+tf.1 #2
[ 5205.613929] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5205.613931] task: ffff8803a91f2000 ti: ffff88036dc14000 task.ti: ffff88036dc14000
[ 5205.613932] RIP: 0010:[<ffffffff81080e07>]  [<ffffffff81080e07>] do_raw_spin_lock+0xb2/0x130
[ 5205.613937] RSP: 0018:ffff88036dc17b38  EFLAGS: 00000002
[ 5205.613938] RAX: 0000000000000000 RBX: ffff8800cddfa288 RCX: 0000000000000002
[ 5205.613939] RDX: 0000000000000002 RSI: ffffffff814d269f RDI: 0000000000000001
[ 5205.613940] RBP: 000000003710e3d7 R08: 000000000000006a R09: e0525e8c9b98c0a6
[ 5205.613941] R10: 0000000000000001 R11: 0000000000000001 R12: 00000000e4eb3db0
[ 5205.613942] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 5205.613944] FS:  00007fc901b8cb40(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
[ 5205.613945] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5205.613946] CR2: ffffffffff600400 CR3: 0000000397502000 CR4: 00000000000007f0
[ 5205.613947] Stack:
[ 5205.613948]  ffff8800cddfa280 ffff8803a91f2000 0000000000000002 ffffffff813bea20
[ 5205.613950]  ffff88040b0b8000 0000000000000282 ffff880369c42340 ffff8802e7e2f048
[ 5205.613952]  ffff8803a91f2000 0000000000000001 ffff88036dc17c20 ffff8800cddfa280
[ 5205.613953] Call Trace:
[ 5205.613957]  [<ffffffff813bea20>] ? rwsem_down_read_failed+0x2c/0xd9
[ 5205.613961]  [<ffffffff811f43a4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 5205.613962]  [<ffffffff813be4e3>] ? down_read+0xe/0xf
[ 5205.613981]  [<ffffffffa139cdc0>] ? zap_get_leaf_byblk+0x1d3/0x2ad [zfs]
[ 5205.613986]  [<ffffffffa09251bb>] ? kmem_alloc_debug+0x10e/0x2c2 [spl]
[ 5205.613998]  [<ffffffffa139cef8>] ? zap_deref_leaf+0x5e/0x67 [zfs]
[ 5205.614009]  [<ffffffffa139ddb4>] ? fzap_lookup+0xe4/0x101 [zfs]
[ 5205.614020]  [<ffffffffa13a0568>] ? zap_hashbits+0x5/0x15 [zfs]
[ 5205.614031]  [<ffffffffa13a0657>] ? zap_hash+0x9b/0xb1 [zfs]
[ 5205.614042]  [<ffffffffa13a081a>] ? zap_name_alloc+0xba/0xcd [zfs]
[ 5205.614053]  [<ffffffffa13a038b>] ? mzap_addent+0xec/0xec [zfs]
[ 5205.614064]  [<ffffffffa13a22eb>] ? zap_lookup_norm+0x9c/0x160 [zfs]
[ 5205.614075]  [<ffffffffa13a241c>] ? zap_lookup+0x2a/0x2f [zfs]
[ 5205.614084]  [<ffffffffa13d3dd4>] ? zvol_get_stats+0x30/0xb5 [zfs]
[ 5205.614095]  [<ffffffffa1371395>] ? dsl_prop_nvlist_add_uint64+0x198/0x1a1 [zfs]
[ 5205.614104]  [<ffffffffa1356e6a>] ? dmu_objset_stats+0x2b/0x49 [zfs]
[ 5205.614114]  [<ffffffffa13acbcb>] ? zfs_ioc_objset_stats_impl+0x77/0xd1 [zfs]
[ 5205.614124]  [<ffffffffa13acd9b>] ? zfs_ioc_objset_stats+0x2a/0x49 [zfs]
[ 5205.614127]  [<ffffffffa0927108>] ? strdup+0x20/0x34 [spl]
[ 5205.614137]  [<ffffffffa13b1b22>] ? zfsdev_ioctl+0x33b/0x3f7 [zfs]
[ 5205.614139]  [<ffffffff8115232c>] ? fsnotify+0x237/0x273
[ 5205.614142]  [<ffffffff8112f354>] ? do_vfs_ioctl+0x406/0x44f
[ 5205.614144]  [<ffffffff81120fd1>] ? vfs_write+0x11a/0x176
[ 5205.614146]  [<ffffffff8112f3e6>] ? SyS_ioctl+0x49/0x77
[ 5205.614148]  [<ffffffff813c760e>] ? device_not_available+0x1e/0x30
[ 5205.614150]  [<ffffffff813c5f39>] ? system_call_fastpath+0x16/0x1b
[ 5205.614151] Code: 31 ed eb 2f 66 8b 13 0f b6 c6 0f b6 ca 83 e0 fe 39 c1 75 12 8d 8a 00 02 00 00 89 d0 f0 66 0f b1 0b 66 39 d0 74 66 bf 01 00 00 00 <48> ff c5 e8 ca 24 17 00 4c 39 e5 75 cc 48 c7 c6 b7 26 4d 81 48 
[ 5205.614167] NMI backtrace for cpu 4
[ 5205.614169] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P           O  3.15.0+tf.1 #2
[ 5205.614171] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5205.614172] task: ffff88041d958000 ti: ffff88041d990000 task.ti: ffff88041d990000
[ 5205.614174] RIP: 0010:[<ffffffff811f32d9>]  [<ffffffff811f32d9>] __delay+0x0/0x9
[ 5205.614178] RSP: 0018:ffff88041fd03e20  EFLAGS: 00000807
[ 5205.614179] RAX: 0000000008d0c600 RBX: 0000000000002710 RCX: 0000000000000007
[ 5205.614181] RDX: 00000000003a2286 RSI: 0000000000000002 RDI: 00000000003a2287
[ 5205.614182] RBP: ffff88041fd0f450 R08: 0000000000000002 R09: 00000000fffffffe
[ 5205.614183] R10: 00000000fffffffe R11: 0000000000000000 R12: 0000000000000001
[ 5205.614184] R13: ffffffff815ba000 R14: 0000000000000004 R15: ffffffff815bc000
[ 5205.614185] FS:  00007fd8e1d08700(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
[ 5205.614186] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5205.614187] CR2: ffffffffff600400 CR3: 00000000dc105000 CR4: 00000000000007e0
[ 5205.614188] Stack:
[ 5205.614189]  ffffffff81037ca5 ffffffff815ba000 ffffffff81092533 001dcd6500000000
[ 5205.614191]  000000000000020d 0000000000000082 ffffffff81618d58 ffff88041d958000
[ 5205.614193]  ffff88041d958000 0000000000000000 0000000000000004 ffff88041fd0eeb0
[ 5205.614194] Call Trace:
[ 5205.614195]  <IRQ> 
[ 5205.614196]  [<ffffffff81037ca5>] ? arch_trigger_all_cpu_backtrace+0x60/0x6b
[ 5205.614202]  [<ffffffff81092533>] ? rcu_check_callbacks+0x438/0x558
[ 5205.614204]  [<ffffffff8109a7dc>] ? tick_nohz_handler+0x87/0x87
[ 5205.614207]  [<ffffffff81055952>] ? update_process_times+0x31/0x56
[ 5205.614209]  [<ffffffff8109a668>] ? tick_sched_handle+0x3f/0x4c
[ 5205.614210]  [<ffffffff8109a80c>] ? tick_sched_timer+0x30/0x4c
[ 5205.614212]  [<ffffffff8106681f>] ? __run_hrtimer+0xdb/0x1a0
[ 5205.614214]  [<ffffffff81067040>] ? hrtimer_interrupt+0xe4/0x1d1
[ 5205.614216]  [<ffffffff81036564>] ? smp_apic_timer_interrupt+0x36/0x46
[ 5205.614219]  [<ffffffff813c6b1d>] ? apic_timer_interrupt+0x6d/0x80
[ 5205.614220]  <EOI> 
[ 5205.614220]  [<ffffffff8103ea63>] ? native_safe_halt+0x2/0x3
[ 5205.614224]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5205.614226]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5205.614228]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5205.614230]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5205.614232] Code: 90 90 48 89 f8 48 85 c0 74 1d eb 06 66 0f 1f 44 00 00 eb 0e 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 ff c8 75 fb 48 ff c8 c3 <48> 8b 05 58 71 3e 00 ff e0 65 48 8b 14 25 20 43 01 00 48 6b d2 
[ 5205.614249] NMI backtrace for cpu 5
[ 5205.614253] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           O  3.15.0+tf.1 #2
[ 5205.614255] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5205.614256] task: ffff88041d98a000 ti: ffff88041d9b4000 task.ti: ffff88041d9b4000
[ 5205.614258] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5205.614262] RSP: 0018:ffff88041d9b7eb0  EFLAGS: 00000282
[ 5205.614264] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[ 5205.614265] RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000082
[ 5205.614266] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
[ 5205.614267] R10: ffff88041fd4eb40 R11: 0000000000000400 R12: 0000000000000000
[ 5205.614268] R13: ffff88041d9b7fd8 R14: ffff88041d9b4000 R15: ffff88041d9b7fd8
[ 5205.614269] FS:  00007fc02d179700(0000) GS:ffff88041fd40000(0000) knlGS:0000000000000000
[ 5205.614271] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5205.614272] CR2: ffffffffff600400 CR3: 000000040795f000 CR4: 00000000000007e0
[ 5205.614273] Stack:
[ 5205.614274]  ffffffff81017688 0000000000000020 0000000000000000 ffff88041d9b7fd8
[ 5205.614275]  ffffffff810177ad 0000000000000000 0000000500000000 ffff88041d9b7fd8
[ 5205.614277]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5205.614279] Call Trace:
[ 5205.614282]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5205.614284]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5205.614287]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5205.614290]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5205.614291] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5205.614309] NMI backtrace for cpu 2
[ 5205.614313] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O  3.15.0+tf.1 #2
[ 5205.614314] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5205.614316] task: ffff88041d956000 ti: ffff88041d984000 task.ti: ffff88041d984000
[ 5205.614317] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5205.614321] RSP: 0018:ffff88041d987eb0  EFLAGS: 00000282
[ 5205.614323] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fc86462
[ 5205.614324] RDX: 000004ba036e7a00 RSI: 0000000000000082 RDI: 0000000000000082
[ 5205.614325] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[ 5205.614326] R10: ffff88041fc8eb40 R11: 0000000000000400 R12: 0000000000000000
[ 5205.614327] R13: ffff88041d987fd8 R14: ffff88041d984000 R15: ffff88041d987fd8
[ 5205.614329] FS:  00007ff7bc75a720(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
[ 5205.614330] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5205.614331] CR2: ffffffffff600400 CR3: 000000040b27e000 CR4: 00000000000007e0
[ 5205.614332] Stack:
[ 5205.614333]  ffffffff81017688 000000001fc86462 0000000000000000 ffff88041d987fd8
[ 5205.614335]  ffffffff810177ad 0000000000000000 0000000200000000 ffff88041d987fd8
[ 5205.614336]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5205.614338] Call Trace:
[ 5205.614341]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5205.614343]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5205.614346]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5205.614349]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5205.614350] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5205.614367] NMI backtrace for cpu 1
[ 5205.614372] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O  3.15.0+tf.1 #2
[ 5205.614373] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5205.614375] task: ffff88041d953000 ti: ffff88041d980000 task.ti: ffff88041d980000
[ 5205.614376] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5205.614380] RSP: 0018:ffff88041d983eb0  EFLAGS: 00000282
[ 5205.614382] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fc46260
[ 5205.614383] RDX: 000004ba036e7a00 RSI: 0000000000000082 RDI: 0000000000000082
[ 5205.614384] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 5205.614385] R10: ffff88041fc4eb40 R11: 0000000000000400 R12: 0000000000000000
[ 5205.614386] R13: ffff88041d983fd8 R14: ffff88041d980000 R15: ffff88041d983fd8
[ 5205.614388] FS:  00007f6f58ed4740(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000
[ 5205.614389] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5205.614390] CR2: ffffffffff600400 CR3: 000000040b1a1000 CR4: 00000000000007e0
[ 5205.614391] Stack:
[ 5205.614392]  ffffffff81017688 000000001fc46260 0000000000000000 ffff88041d983fd8
[ 5205.614393]  ffffffff810177ad 0000000000000000 0000000100000000 ffff88041d983fd8
[ 5205.614395]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5205.614397] Call Trace:
[ 5205.614400]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5205.614402]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5205.614405]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5205.614407]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5205.614409] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5205.614426] NMI backtrace for cpu 3
[ 5205.614430] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O  3.15.0+tf.1 #2
[ 5205.614432] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5205.614433] task: ffff88041d957000 ti: ffff88041d98c000 task.ti: ffff88041d98c000
[ 5205.614435] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5205.614439] RSP: 0018:ffff88041d98feb0  EFLAGS: 00000282
[ 5205.614440] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fcc6866
[ 5205.614441] RDX: 000004ba036e7a00 RSI: 0000000000000082 RDI: 0000000000000082
[ 5205.614442] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 5205.614443] R10: ffff88041fcceb40 R11: 0000000000000400 R12: 0000000000000000
[ 5205.614444] R13: ffff88041d98ffd8 R14: ffff88041d98c000 R15: ffff88041d98ffd8
[ 5205.614446] FS:  00007fc02727a700(0000) GS:ffff88041fcc0000(0000) knlGS:0000000000000000
[ 5205.614447] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5205.614448] CR2: ffffffffff600400 CR3: 000000040795f000 CR4: 00000000000007e0
[ 5205.614449] Stack:
[ 5205.614450]  ffffffff81017688 000000001fcc6866 0000000000000000 ffff88041d98ffd8
[ 5205.614452]  ffffffff810177ad 0000000000000000 0000000300000000 ffff88041d98ffd8
[ 5205.614454]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5205.614455] Call Trace:
[ 5205.614459]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5205.614461]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5205.614463]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5205.614466]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5205.614467] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5288.891090] INFO: task txg_sync:5779 blocked for more than 120 seconds.
[ 5288.891142]       Tainted: P           O  3.15.0+tf.1 #2
[ 5288.891187] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5288.891264] txg_sync        D ffffffff810725bb     0  5779      2 0x00000000
[ 5288.891267]  ffff8803f8e3b000 0000000000000046 0000000000000002 ffff88041d957000
[ 5288.891269]  0000000000014c00 0000000000014c00 ffff8803f8e3b000 ffff8803ebdcbfd8
[ 5288.891271]  0000000000000001 0000000000000282 dead000000200200 ffff88040c472a00
[ 5288.891273] Call Trace:
[ 5288.891282]  [<ffffffffa09249b4>] ? spl_kmem_cache_free+0x8a/0x359 [spl]
[ 5288.891285]  [<ffffffff8107b51f>] ? __init_waitqueue_head+0x25/0x25
[ 5288.891304]  [<ffffffffa137aa38>] ? metaslab_sync+0xb2/0x53f [zfs]
[ 5288.891308]  [<ffffffffa092eb2a>] ? cv_wait_common+0xcd/0x15b [spl]
[ 5288.891310]  [<ffffffff8107b51f>] ? __init_waitqueue_head+0x25/0x25
[ 5288.891322]  [<ffffffffa137d0b3>] ? rrw_enter_write+0x26/0x69 [zfs]
[ 5288.891334]  [<ffffffffa1384807>] ? spa_sync+0x5db/0x921 [zfs]
[ 5288.891337]  [<ffffffff81016573>] ? read_tsc+0x11/0x22
[ 5288.891339]  [<ffffffff8109391a>] ? ktime_get_ts+0x49/0xab
[ 5288.891350]  [<ffffffffa139187d>] ? txg_sync_thread+0x2dd/0x4f2 [zfs]
[ 5288.891353]  [<ffffffff810168ec>] ? sched_clock+0x11/0x14
[ 5288.891364]  [<ffffffffa13915a0>] ? txg_init+0x1b2/0x1b2 [zfs]
[ 5288.891367]  [<ffffffffa0928517>] ? thread_generic_wrapper+0x6a/0x75 [spl]
[ 5288.891370]  [<ffffffffa09284ad>] ? __thread_create+0x2cf/0x2cf [spl]
[ 5288.891373]  [<ffffffff810646ec>] ? kthread+0xc5/0xcd
[ 5288.891375]  [<ffffffff81064627>] ? __init_kthread_worker+0x2d/0x2d
[ 5288.891377]  [<ffffffff813c5e8c>] ? ret_from_fork+0x7c/0xb0
[ 5288.891379]  [<ffffffff81064627>] ? __init_kthread_worker+0x2d/0x2d
[ 5385.896256] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=60007 jiffies, g=19657, c=19656, q=2399)
[ 5385.896355] sending NMI to all CPUs:
[ 5385.896357] NMI backtrace for cpu 0
[ 5385.896360] CPU: 0 PID: 9150 Comm: zfs Tainted: P           O  3.15.0+tf.1 #2
[ 5385.896361] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5385.896362] task: ffff8803a91f2000 ti: ffff88036dc14000 task.ti: ffff88036dc14000
[ 5385.896363] RIP: 0010:[<ffffffff811f336c>]  [<ffffffff811f336c>] delay_tsc+0x2f/0x88
[ 5385.896365] RSP: 0018:ffff88036dc17b10  EFLAGS: 00000002
[ 5385.896366] RAX: 000013172f8bf129 RBX: 000000002f8bf129 RCX: 0000000000000002
[ 5385.896367] RDX: 000013172f8bf129 RSI: ffffffff814d269f RDI: 0000000000000001
[ 5385.896368] RBP: 0000000000000001 R08: 000000000000006a R09: e0525e8c9b98c0a6
[ 5385.896369] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 5385.896370] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 5385.896372] FS:  00007fc901b8cb40(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
[ 5385.896373] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5385.896374] CR2: ffffffffff600400 CR3: 0000000397502000 CR4: 00000000000007f0
[ 5385.896374] Stack:
[ 5385.896375]  ffff8800cddfa288 00000000dbfcca2b 00000000e4eb3db0 0000000000000000
[ 5385.896377]  ffffffff81080e0f ffff8800cddfa280 ffff8803a91f2000 0000000000000002
[ 5385.896378]  ffffffff813bea20 ffff88040b0b8000 0000000000000282 ffff880369c42340
[ 5385.896380] Call Trace:
[ 5385.896382]  [<ffffffff81080e0f>] ? do_raw_spin_lock+0xba/0x130
[ 5385.896384]  [<ffffffff813bea20>] ? rwsem_down_read_failed+0x2c/0xd9
[ 5385.896386]  [<ffffffff811f43a4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 5385.896388]  [<ffffffff813be4e3>] ? down_read+0xe/0xf
[ 5385.896399]  [<ffffffffa139cdc0>] ? zap_get_leaf_byblk+0x1d3/0x2ad [zfs]
[ 5385.896402]  [<ffffffffa09251bb>] ? kmem_alloc_debug+0x10e/0x2c2 [spl]
[ 5385.896413]  [<ffffffffa139cef8>] ? zap_deref_leaf+0x5e/0x67 [zfs]
[ 5385.896424]  [<ffffffffa139ddb4>] ? fzap_lookup+0xe4/0x101 [zfs]
[ 5385.896435]  [<ffffffffa13a0568>] ? zap_hashbits+0x5/0x15 [zfs]
[ 5385.896446]  [<ffffffffa13a0657>] ? zap_hash+0x9b/0xb1 [zfs]
[ 5385.896457]  [<ffffffffa13a081a>] ? zap_name_alloc+0xba/0xcd [zfs]
[ 5385.896468]  [<ffffffffa13a038b>] ? mzap_addent+0xec/0xec [zfs]
[ 5385.896478]  [<ffffffffa13a22eb>] ? zap_lookup_norm+0x9c/0x160 [zfs]
[ 5385.896489]  [<ffffffffa13a241c>] ? zap_lookup+0x2a/0x2f [zfs]
[ 5385.896497]  [<ffffffffa13d3dd4>] ? zvol_get_stats+0x30/0xb5 [zfs]
[ 5385.896509]  [<ffffffffa1371395>] ? dsl_prop_nvlist_add_uint64+0x198/0x1a1 [zfs]
[ 5385.896517]  [<ffffffffa1356e6a>] ? dmu_objset_stats+0x2b/0x49 [zfs]
[ 5385.896527]  [<ffffffffa13acbcb>] ? zfs_ioc_objset_stats_impl+0x77/0xd1 [zfs]
[ 5385.896537]  [<ffffffffa13acd9b>] ? zfs_ioc_objset_stats+0x2a/0x49 [zfs]
[ 5385.896540]  [<ffffffffa0927108>] ? strdup+0x20/0x34 [spl]
[ 5385.896550]  [<ffffffffa13b1b22>] ? zfsdev_ioctl+0x33b/0x3f7 [zfs]
[ 5385.896551]  [<ffffffff8115232c>] ? fsnotify+0x237/0x273
[ 5385.896553]  [<ffffffff8112f354>] ? do_vfs_ioctl+0x406/0x44f
[ 5385.896555]  [<ffffffff81120fd1>] ? vfs_write+0x11a/0x176
[ 5385.896557]  [<ffffffff8112f3e6>] ? SyS_ioctl+0x49/0x77
[ 5385.896559]  [<ffffffff813c760e>] ? device_not_available+0x1e/0x30
[ 5385.896560]  [<ffffffff813c5f39>] ? system_call_fastpath+0x16/0x1b
[ 5385.896561] Code: 55 53 65 44 8b 24 25 1c b0 00 00 0f ae f0 66 66 90 48 83 3d b6 58 3b 00 00 75 02 eb 50 89 fd e8 ff 2f e2 ff 66 90 89 c3 0f ae f0 <66> 66 90 48 83 3d 99 58 3b 00 00 75 02 eb 33 e8 e4 2f e2 ff 66 
[ 5385.896577] NMI backtrace for cpu 4
[ 5385.896579] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P           O  3.15.0+tf.1 #2
[ 5385.896581] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5385.896582] task: ffff88041d958000 ti: ffff88041d990000 task.ti: ffff88041d990000
[ 5385.896584] RIP: 0010:[<ffffffff811f32d9>]  [<ffffffff811f32d9>] __delay+0x0/0x9
[ 5385.896588] RSP: 0018:ffff88041fd03e20  EFLAGS: 00000807
[ 5385.896589] RAX: 0000000008d0c600 RBX: 0000000000002710 RCX: 0000000000000007
[ 5385.896591] RDX: 00000000003a2286 RSI: 0000000000000002 RDI: 00000000003a2287
[ 5385.896592] RBP: ffff88041fd0f450 R08: 0000000000000002 R09: 00000000fffffffe
[ 5385.896593] R10: 00000000fffffffe R11: 0000000000000000 R12: 0000000000000001
[ 5385.896594] R13: ffffffff815ba000 R14: 0000000000000004 R15: ffffffff815bc000
[ 5385.896595] FS:  00007fd8e1d08700(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
[ 5385.896596] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5385.896597] CR2: 00007f02a2e5c000 CR3: 00000000dc105000 CR4: 00000000000007e0
[ 5385.896598] Stack:
[ 5385.896599]  ffffffff81037ca5 ffffffff815ba000 ffffffff81092533 001dcd6500000000
[ 5385.896601]  000000000000095f 0000000000000082 ffffffff81618d58 ffff88041d958000
[ 5385.896603]  ffff88041d958000 0000000000000000 0000000000000004 ffff88041fd0eeb0
[ 5385.896605] Call Trace:
[ 5385.896606]  <IRQ> 
[ 5385.896606]  [<ffffffff81037ca5>] ? arch_trigger_all_cpu_backtrace+0x60/0x6b
[ 5385.896612]  [<ffffffff81092533>] ? rcu_check_callbacks+0x438/0x558
[ 5385.896615]  [<ffffffff8109a7dc>] ? tick_nohz_handler+0x87/0x87
[ 5385.896617]  [<ffffffff81055952>] ? update_process_times+0x31/0x56
[ 5385.896619]  [<ffffffff8109a668>] ? tick_sched_handle+0x3f/0x4c
[ 5385.896621]  [<ffffffff8109a80c>] ? tick_sched_timer+0x30/0x4c
[ 5385.896623]  [<ffffffff8106681f>] ? __run_hrtimer+0xdb/0x1a0
[ 5385.896624]  [<ffffffff81067040>] ? hrtimer_interrupt+0xe4/0x1d1
[ 5385.896627]  [<ffffffff81036564>] ? smp_apic_timer_interrupt+0x36/0x46
[ 5385.896629]  [<ffffffff813c6b1d>] ? apic_timer_interrupt+0x6d/0x80
[ 5385.896630]  <EOI> 
[ 5385.896631]  [<ffffffff8103ea63>] ? native_safe_halt+0x2/0x3
[ 5385.896635]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5385.896636]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5385.896639]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5385.896641]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5385.896642] Code: 90 90 48 89 f8 48 85 c0 74 1d eb 06 66 0f 1f 44 00 00 eb 0e 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 ff c8 75 fb 48 ff c8 c3 <48> 8b 05 58 71 3e 00 ff e0 65 48 8b 14 25 20 43 01 00 48 6b d2 
[ 5385.896659] NMI backtrace for cpu 1
[ 5385.896663] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O  3.15.0+tf.1 #2
[ 5385.896664] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5385.896666] task: ffff88041d953000 ti: ffff88041d980000 task.ti: ffff88041d980000
[ 5385.896667] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5385.896672] RSP: 0018:ffff88041d983eb0  EFLAGS: 00000282
[ 5385.896673] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fc4403e
[ 5385.896674] RDX: 000004e3ed75af00 RSI: 0000000000000082 RDI: 0000000000000082
[ 5385.896675] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 5385.896676] R10: ffff88041fc4eb40 R11: 0000000000000400 R12: 0000000000000000
[ 5385.896677] R13: ffff88041d983fd8 R14: ffff88041d980000 R15: ffff88041d983fd8
[ 5385.896679] FS:  00007f6f58ed4740(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000
[ 5385.896680] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5385.896681] CR2: 00007fcfe5937000 CR3: 000000040b1a1000 CR4: 00000000000007e0
[ 5385.896682] Stack:
[ 5385.896683]  ffffffff81017688 000000001fc4403e 0000000000000000 ffff88041d983fd8
[ 5385.896685]  ffffffff810177ad 0000000000000000 0000000100000000 ffff88041d983fd8
[ 5385.896686]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5385.896688] Call Trace:
[ 5385.896691]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5385.896694]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5385.896696]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5385.896699]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5385.896700] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5385.896718] NMI backtrace for cpu 3
[ 5385.896722] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O  3.15.0+tf.1 #2
[ 5385.896723] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5385.896725] task: ffff88041d957000 ti: ffff88041d98c000 task.ti: ffff88041d98c000
[ 5385.896726] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5385.896731] RSP: 0018:ffff88041d98feb0  EFLAGS: 00000282
[ 5385.896732] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fcc3a38
[ 5385.896733] RDX: 000004e3ed75af00 RSI: 0000000000000082 RDI: 0000000000000082
[ 5385.896734] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
[ 5385.896735] R10: ffff88041fcceb40 R11: 0000000000000400 R12: 0000000000000000
[ 5385.896736] R13: ffff88041d98ffd8 R14: ffff88041d98c000 R15: ffff88041d98ffd8
[ 5385.896738] FS:  00007fc02727a700(0000) GS:ffff88041fcc0000(0000) knlGS:0000000000000000
[ 5385.896739] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5385.896740] CR2: ffffffffff600400 CR3: 000000040795f000 CR4: 00000000000007e0
[ 5385.896741] Stack:
[ 5385.896742]  ffffffff81017688 000000001fcc3a38 0000000000000000 ffff88041d98ffd8
[ 5385.896743]  ffffffff810177ad 0000000000000000 0000000300000000 ffff88041d98ffd8
[ 5385.896745]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5385.896747] Call Trace:
[ 5385.896750]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5385.896752]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5385.896754]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5385.896757]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5385.896758] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5385.896776] NMI backtrace for cpu 2
[ 5385.896780] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O  3.15.0+tf.1 #2
[ 5385.896782] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5385.896784] task: ffff88041d956000 ti: ffff88041d984000 task.ti: ffff88041d984000
[ 5385.896785] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5385.896789] RSP: 0018:ffff88041d987eb0  EFLAGS: 00000282
[ 5385.896791] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fc83e3c
[ 5385.896792] RDX: 000004e3ed75af00 RSI: 0000000000000082 RDI: 0000000000000082
[ 5385.896793] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[ 5385.896794] R10: ffff88041fc8eb40 R11: 0000000000000400 R12: 0000000000000000
[ 5385.896795] R13: ffff88041d987fd8 R14: ffff88041d984000 R15: ffff88041d987fd8
[ 5385.896796] FS:  00007ff7bc75a720(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
[ 5385.896798] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5385.896799] CR2: ffffffffff600400 CR3: 000000040b27e000 CR4: 00000000000007e0
[ 5385.896800] Stack:
[ 5385.896801]  ffffffff81017688 000000001fc83e3c 0000000000000000 ffff88041d987fd8
[ 5385.896802]  ffffffff810177ad 0000000000000000 0000000200000000 ffff88041d987fd8
[ 5385.896804]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5385.896806] Call Trace:
[ 5385.896809]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5385.896811]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5385.896814]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5385.896817]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5385.896818] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5385.896836] NMI backtrace for cpu 5
[ 5385.896840] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           O  3.15.0+tf.1 #2
[ 5385.896841] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5385.896843] task: ffff88041d98a000 ti: ffff88041d9b4000 task.ti: ffff88041d9b4000
[ 5385.896844] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
[ 5385.896849] RSP: 0018:ffff88041d9b7eb0  EFLAGS: 00000282
[ 5385.896850] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
[ 5385.896852] RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000082
[ 5385.896853] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
[ 5385.896854] R10: ffff88041fd4eb40 R11: 0000000000000400 R12: 0000000000000000
[ 5385.896855] R13: ffff88041d9b7fd8 R14: ffff88041d9b4000 R15: ffff88041d9b7fd8
[ 5385.896856] FS:  00007fc02d179700(0000) GS:ffff88041fd40000(0000) knlGS:0000000000000000
[ 5385.896857] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5385.896858] CR2: ffffffffff600400 CR3: 000000040795f000 CR4: 00000000000007e0
[ 5385.896860] Stack:
[ 5385.896861]  ffffffff81017688 0000000000000020 0000000000000000 ffff88041d9b7fd8
[ 5385.896862]  ffffffff810177ad 0000000000000000 0000000500000000 ffff88041d9b7fd8
[ 5385.896864]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
[ 5385.896866] Call Trace:
[ 5385.896869]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
[ 5385.896871]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
[ 5385.896874]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
[ 5385.896876]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
[ 5385.896878] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
[ 5395.659828] BUG: spinlock lockup suspected on CPU#0, zfs/9150
[ 5395.659884] BUG: unable to handle kernel NULL pointer dereference at 00000000000003d5
[ 5395.659965] IP: [<ffffffff813b8c88>] spin_dump+0x51/0x88
[ 5395.660011] PGD 38b275067 PUD 3828a6067 PMD 0 
[ 5395.660057] Oops: 0000 [#1] SMP 
[ 5395.660099] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache ipv6 ext4 jbd2 crc16 sunrpc ext2 dm_crypt zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) zlib_deflate microcode snd_pcm kvm_amd i2c_piix4 snd_timer kvm snd soundcore i2c_core sp5100_tco k10temp edac_core pcspkr processor mxm_wmi evdev wmi thermal_sys button ext3 mbcache jbd dm_mod raid1 md_mod sg sd_mod crc_t10dif crct10dif_common usb_storage firewire_ohci firewire_core crc_itu_t r8169 mii mvsas ahci libsas ohci_pci ohci_hcd xhci_hcd libahci libata scsi_transport_sas scsi_mod ehci_pci ehci_hcd
[ 5395.660621] CPU: 0 PID: 9150 Comm: zfs Tainted: P           O  3.15.0+tf.1 #2
[ 5395.660671] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 5395.660756] task: ffff8803a91f2000 ti: ffff88036dc14000 task.ti: ffff88036dc14000
[ 5395.660832] RIP: 0010:[<ffffffff813b8c88>]  [<ffffffff813b8c88>] spin_dump+0x51/0x88
[ 5395.660910] RSP: 0018:ffff88036dc17b18  EFLAGS: 00010002
[ 5395.660954] RAX: 0000000000000031 RBX: 0000000000000001 RCX: 0000000000000007
[ 5395.661005] RDX: 0000000000000006 RSI: 0000000000000082 RDI: 0000000000000082
[ 5395.661055] RBP: ffff8800cddfa288 R08: 0000000000000000 R09: 0000000069c42340
[ 5395.661106] R10: 0000000000000000 R11: ffff88041d82d000 R12: 00000000e4eb3db0
[ 5395.661156] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 5395.661207] FS:  00007fc901b8cb40(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
[ 5395.661285] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5395.661331] CR2: 00000000000003d5 CR3: 0000000397502000 CR4: 00000000000007f0
[ 5395.661381] Stack:
[ 5395.661414]  0000000000000075 ffff8800cddfa288 00000000e4eb3db0 ffffffff81080e23
[ 5395.661496]  ffff8800cddfa280 ffff8803a91f2000 0000000000000002 ffffffff813bea20
[ 5395.661607]  ffff88040b0b8000 0000000000000282 ffff880369c42340 ffff8802e7e2f048
[ 5395.661689] Call Trace:
[ 5395.661725]  [<ffffffff81080e23>] ? do_raw_spin_lock+0xce/0x130
[ 5395.661772]  [<ffffffff813bea20>] ? rwsem_down_read_failed+0x2c/0xd9
[ 5395.661821]  [<ffffffff811f43a4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 5395.661871]  [<ffffffff813be4e3>] ? down_read+0xe/0xf
[ 5395.661925]  [<ffffffffa139cdc0>] ? zap_get_leaf_byblk+0x1d3/0x2ad [zfs]
[ 5395.661975]  [<ffffffffa09251bb>] ? kmem_alloc_debug+0x10e/0x2c2 [spl]
[ 5395.662034]  [<ffffffffa139cef8>] ? zap_deref_leaf+0x5e/0x67 [zfs]
[ 5395.662091]  [<ffffffffa139ddb4>] ? fzap_lookup+0xe4/0x101 [zfs]
[ 5395.662148]  [<ffffffffa13a0568>] ? zap_hashbits+0x5/0x15 [zfs]
[ 5395.662204]  [<ffffffffa13a0657>] ? zap_hash+0x9b/0xb1 [zfs]
[ 5395.662260]  [<ffffffffa13a081a>] ? zap_name_alloc+0xba/0xcd [zfs]
[ 5395.662317]  [<ffffffffa13a038b>] ? mzap_addent+0xec/0xec [zfs]
[ 5395.662373]  [<ffffffffa13a22eb>] ? zap_lookup_norm+0x9c/0x160 [zfs]
[ 5395.662430]  [<ffffffffa13a241c>] ? zap_lookup+0x2a/0x2f [zfs]
[ 5395.662484]  [<ffffffffa13d3dd4>] ? zvol_get_stats+0x30/0xb5 [zfs]
[ 5395.662541]  [<ffffffffa1371395>] ? dsl_prop_nvlist_add_uint64+0x198/0x1a1 [zfs]
[ 5395.662625]  [<ffffffffa1356e6a>] ? dmu_objset_stats+0x2b/0x49 [zfs]
[ 5395.662701]  [<ffffffffa13acbcb>] ? zfs_ioc_objset_stats_impl+0x77/0xd1 [zfs]
[ 5395.662760]  [<ffffffffa13acd9b>] ? zfs_ioc_objset_stats+0x2a/0x49 [zfs]
[ 5395.662811]  [<ffffffffa0927108>] ? strdup+0x20/0x34 [spl]
[ 5395.662865]  [<ffffffffa13b1b22>] ? zfsdev_ioctl+0x33b/0x3f7 [zfs]
[ 5395.662913]  [<ffffffff8115232c>] ? fsnotify+0x237/0x273
[ 5395.662958]  [<ffffffff8112f354>] ? do_vfs_ioctl+0x406/0x44f
[ 5395.663004]  [<ffffffff81120fd1>] ? vfs_write+0x11a/0x176
[ 5395.663050]  [<ffffffff8112f3e6>] ? SyS_ioctl+0x49/0x77
[ 5395.663094]  [<ffffffff813c760e>] ? device_not_available+0x1e/0x30
[ 5395.663143]  [<ffffffff813c5f39>] ? system_call_fastpath+0x16/0x1b
[ 5395.663190] Code: 25 80 d6 00 00 44 8b 80 d4 03 00 00 48 8d 88 88 05 00 00 31 c0 65 8b 14 25 1c b0 00 00 e8 55 01 00 00 48 85 db 44 8b 4d 08 74 10 <44> 8b 83 d4 03 00 00 48 8d 8b 88 05 00 00 eb 0b 41 83 c8 ff 48 
[ 5395.663440] RIP  [<ffffffff813b8c88>] spin_dump+0x51/0x88
[ 5395.663486]  RSP <ffff88036dc17b18>
[ 5395.663524] CR2: 00000000000003d5
[ 5395.663854] ---[ end trace b0664fcadf4bf115 ]---

@behlendorf
Copy link
Contributor

OK, given the stacks I can explain a small part of this.

The txg_sync thread is blocked on the dp_config_rwlock. This lock was taken by the zfs process, pid 9150, when it entered the zfs_ioc_objset_stats() function. Pid 9150 then when on to successfully acquire the zap_rwlock for the ZVOL_ZAP_OBJ, but is blocked waiting to acquire the lock for a particular leaf. Unfortunately, it's completely unclear from the stack why it can't also acquire the l_rwlock on the leaf.

One possibility is that this is being caused by some old compatibility code in the spl. Can you try reverting commit openzfs/spl@46aa7b3 and see if it resolves the issue.

If that doesn't do the trick then I would try rolling back to zfs commit 4dd1893 and see if you can reproduce this issue. I don't think this is a new issue we introduced, but changes after that commit may have made it more likely.

@behlendorf
Copy link
Contributor

@FransUrbo Actually, give me a few hours and I'll have a patch for the SPL you can test. I just want to test it first.

@FransUrbo
Copy link
Contributor Author

Sorry, type I meant #2612.

Ah, that's the same as #2480 which I already prepared to use (can't remember if I actually was running that PR when I was doing the final sync of the ZVOLs but I think I was).

I'm doing it now and the import still gives a stack trace (no hard locks any more since I disabled the NMI watchdog).

Trying to go back to the TXG before that sync just gives:

cannot import 'share': one or more devices is currently unavailable

Tried with and without #2598. I'm waiting for your spl patch and see if that allows me to import the pool so I don't have to resync everything again...

@FransUrbo
Copy link
Contributor Author

Using spl/master and going back to 4dd1893 still gives a stack trace at import and using '-T...' still gives

cannot import 'share': one or more devices is currently unavailable

@behlendorf
Copy link
Contributor

@FransUrbo Can you try openzfs/spl#383. It removes a lot of legacy compatibility code and the non-critical rw_tryupgrade() code. All of this code makes certain assumptions about how the rwlock is implemented. If any of those assumptions are subtly wrong it could cause your lockup. Since you're running a modern kernel we can do away with them and see if they're part of the cause.

@FransUrbo
Copy link
Contributor Author

@FransUrbo Can you try openzfs/spl#383.

Still stack trace at import and not possible to use -T :(

@behlendorf
Copy link
Contributor

Still down in zap_get_leaf_byblk blocked on down_read?

@FransUrbo
Copy link
Contributor Author

Looks different now. This is with openzfs/spl#383 and zfs reverted to 4dd1893.

Aug 19 22:44:25 celia kernel: [  464.204613] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 2, t=15002 jiffies, g=4183, c=4182, q=613)
Aug 19 22:44:25 celia kernel: [  464.204915] sending NMI to all CPUs:
Aug 19 22:44:25 celia kernel: [  464.204918] NMI backtrace for cpu 2
Aug 19 22:44:25 celia kernel: [  464.204921] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O  3.15.0+tf.1 #2
Aug 19 22:44:25 celia kernel: [  464.204922] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
Aug 19 22:44:25 celia kernel: [  464.204924] task: ffff88041d956000 ti: ffff88041d984000 task.ti: ffff88041d984000
Aug 19 22:44:25 celia kernel: [  464.204925] RIP: 0010:[<ffffffff811f32d9>]  [<ffffffff811f32d9>] __delay+0x0/0x9
Aug 19 22:44:25 celia kernel: [  464.204930] RSP: 0018:ffff88041fc83e20  EFLAGS: 00000807
Aug 19 22:44:25 celia kernel: [  464.204931] RAX: 00000000e022cb00 RBX: 0000000000002710 RCX: 0000000000000007
Aug 19 22:44:25 celia kernel: [  464.204933] RDX: 00000000003a2315 RSI: 0000000000000002 RDI: 00000000003a2316
Aug 19 22:44:25 celia kernel: [  464.204934] RBP: ffff88041fc8f450 R08: 0000000000000002 R09: 00000000fffffffe
Aug 19 22:44:25 celia kernel: [  464.204935] R10: 00000000fffffffe R11: 0000000000000000 R12: 0000000000000001
Aug 19 22:44:25 celia kernel: [  464.204936] R13: ffffffff815ba000 R14: 0000000000000002 R15: ffffffff815bc000
Aug 19 22:44:25 celia kernel: [  464.204937] FS:  00007fea3d74f700(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
Aug 19 22:44:25 celia kernel: [  464.204938] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 19 22:44:25 celia kernel: [  464.204939] CR2: ffffffffff600400 CR3: 000000040a742000 CR4: 00000000000007e0
Aug 19 22:44:25 celia kernel: [  464.204940] Stack:
Aug 19 22:44:25 celia kernel: [  464.204941]  ffffffff81037ca5 ffffffff815ba000 ffffffff81092533 001dcd6500000000
Aug 19 22:44:25 celia kernel: [  464.204943]  0000000000000265 0000000000000082 ffffffff81618d58 ffff88041d956000
Aug 19 22:44:25 celia kernel: [  464.204945]  ffff88041d956000 0000000000000000 0000000000000002 ffff88041fc8eeb0
Aug 19 22:44:25 celia kernel: [  464.204947] Call Trace:
Aug 19 22:44:25 celia kernel: [  464.204948]  <IRQ> 
Aug 19 22:44:25 celia kernel: [  464.204949]  [<ffffffff81037ca5>] ? arch_trigger_all_cpu_backtrace+0x60/0x6b
Aug 19 22:44:25 celia kernel: [  464.204954]  [<ffffffff81092533>] ? rcu_check_callbacks+0x438/0x558
Aug 19 22:44:25 celia kernel: [  464.204957]  [<ffffffff8109a7dc>] ? tick_nohz_handler+0x87/0x87
Aug 19 22:44:25 celia kernel: [  464.204959]  [<ffffffff81055952>] ? update_process_times+0x31/0x56
Aug 19 22:44:25 celia kernel: [  464.204961]  [<ffffffff8109a668>] ? tick_sched_handle+0x3f/0x4c
Aug 19 22:44:25 celia kernel: [  464.204963]  [<ffffffff8109a80c>] ? tick_sched_timer+0x30/0x4c
Aug 19 22:44:25 celia kernel: [  464.204965]  [<ffffffff8106681f>] ? __run_hrtimer+0xdb/0x1a0
Aug 19 22:44:25 celia kernel: [  464.204967]  [<ffffffff81067040>] ? hrtimer_interrupt+0xe4/0x1d1
Aug 19 22:44:25 celia kernel: [  464.204969]  [<ffffffff81036564>] ? smp_apic_timer_interrupt+0x36/0x46
Aug 19 22:44:25 celia kernel: [  464.204972]  [<ffffffff813c6b1d>] ? apic_timer_interrupt+0x6d/0x80
Aug 19 22:44:25 celia kernel: [  464.204973]  <EOI> 
Aug 19 22:44:25 celia kernel: [  464.204973]  [<ffffffff8103ea63>] ? native_safe_halt+0x2/0x3
Aug 19 22:44:25 celia kernel: [  464.204977]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
Aug 19 22:44:25 celia kernel: [  464.204979]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
Aug 19 22:44:25 celia kernel: [  464.204982]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
Aug 19 22:44:25 celia kernel: [  464.204984]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
Aug 19 22:44:25 celia kernel: [  464.204985] Code: 90 90 48 89 f8 48 85 c0 74 1d eb 06 66 0f 1f 44 00 00 eb 0e 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 ff c8 75 fb 48 ff c8 c3 <48> 8b 05 58 71 3e 00 ff e0 65 48 8b 14 25 20 43 01 00 48 6b d2 
Aug 19 22:44:25 celia kernel: [  464.205002] NMI backtrace for cpu 4
Aug 19 22:44:25 celia kernel: [  464.205006] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P           O  3.15.0+tf.1 #2
Aug 19 22:44:25 celia kernel: [  464.205008] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
Aug 19 22:44:25 celia kernel: [  464.205010] task: ffff88041d958000 ti: ffff88041d990000 task.ti: ffff88041d990000
Aug 19 22:44:25 celia kernel: [  464.205011] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
Aug 19 22:44:25 celia kernel: [  464.205015] RSP: 0018:ffff88041d993eb0  EFLAGS: 00000282
Aug 19 22:44:25 celia kernel: [  464.205017] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fd08684
Aug 19 22:44:25 celia kernel: [  464.205018] RDX: 0000006bacc39700 RSI: 0000000000000082 RDI: 0000000000000082
Aug 19 22:44:25 celia kernel: [  464.205019] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205020] R10: ffff88041fd0eb40 R11: 0000000000000400 R12: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205021] R13: ffff88041d993fd8 R14: ffff88041d990000 R15: ffff88041d993fd8
Aug 19 22:44:25 celia kernel: [  464.205022] FS:  00007f97fff27700(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205024] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 19 22:44:25 celia kernel: [  464.205025] CR2: ffffffffff600400 CR3: 0000000409c69000 CR4: 00000000000007e0
Aug 19 22:44:25 celia kernel: [  464.205026] Stack:
Aug 19 22:44:25 celia kernel: [  464.205027]  ffffffff81017688 000000001fd08684 0000000000000000 ffff88041d993fd8
Aug 19 22:44:25 celia kernel: [  464.205028]  ffffffff810177ad 0000000000000000 0000000400000000 ffff88041d993fd8
Aug 19 22:44:25 celia kernel: [  464.205030]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
Aug 19 22:44:25 celia kernel: [  464.205032] Call Trace:
Aug 19 22:44:25 celia kernel: [  464.205035]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
Aug 19 22:44:25 celia kernel: [  464.205038]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
Aug 19 22:44:25 celia kernel: [  464.205040]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
Aug 19 22:44:25 celia kernel: [  464.205043]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
Aug 19 22:44:25 celia kernel: [  464.205045] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
Aug 19 22:44:25 celia kernel: [  464.205063] NMI backtrace for cpu 0
Aug 19 22:44:25 celia kernel: [  464.205067] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O  3.15.0+tf.1 #2
Aug 19 22:44:25 celia kernel: [  464.205068] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
Aug 19 22:44:25 celia kernel: [  464.205070] task: ffffffff81594450 ti: ffffffff8156c000 task.ti: ffffffff8156c000
Aug 19 22:44:25 celia kernel: [  464.205072] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
Aug 19 22:44:25 celia kernel: [  464.205076] RSP: 0018:ffffffff8156fe90  EFLAGS: 00000282
Aug 19 22:44:25 celia kernel: [  464.205077] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fc07e7c
Aug 19 22:44:25 celia kernel: [  464.205078] RDX: 0000006bacc39700 RSI: 0000000000000082 RDI: 0000000000000082
Aug 19 22:44:25 celia kernel: [  464.205079] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205080] R10: ffff88041fc0eb40 R11: 0000000000000400 R12: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205081] R13: ffffffff8156ffd8 R14: ffffffff8156c000 R15: ffffffff8156ffd8
Aug 19 22:44:25 celia kernel: [  464.205083] FS:  00007fee0d5a3740(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205084] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 19 22:44:25 celia kernel: [  464.205085] CR2: ffffffffff600400 CR3: 00000000dc97a000 CR4: 00000000000007f0
Aug 19 22:44:25 celia kernel: [  464.205086] Stack:
Aug 19 22:44:25 celia kernel: [  464.205087]  ffffffff81017688 000000001fc07e7c 0000000000000000 ffffffff8156ffd8
Aug 19 22:44:25 celia kernel: [  464.205089]  ffffffff810177ad 0000000000000000 0000000000000000 ffffffff8156ffd8
Aug 19 22:44:25 celia kernel: [  464.205091]  ffffffff8107ba73 ffffffff816c8020 ffffffffffffffff ffffffff816c7010
Aug 19 22:44:25 celia kernel: [  464.205092] Call Trace:
Aug 19 22:44:25 celia kernel: [  464.205096]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
Aug 19 22:44:25 celia kernel: [  464.205098]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
Aug 19 22:44:25 celia kernel: [  464.205101]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
Aug 19 22:44:25 celia kernel: [  464.205103]  [<ffffffff81639e68>] ? start_kernel+0x440/0x44b
Aug 19 22:44:25 celia kernel: [  464.205105]  [<ffffffff8163983f>] ? repair_env_string+0x58/0x58
Aug 19 22:44:25 celia kernel: [  464.205108]  [<ffffffff81639117>] ? early_idt_handlers+0x117/0x120
Aug 19 22:44:25 celia kernel: [  464.205110]  [<ffffffff81639655>] ? x86_64_start_kernel+0x185/0x194
Aug 19 22:44:25 celia kernel: [  464.205110] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
Aug 19 22:44:25 celia kernel: [  464.205129] NMI backtrace for cpu 3
Aug 19 22:44:25 celia kernel: [  464.205133] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O  3.15.0+tf.1 #2
Aug 19 22:44:25 celia kernel: [  464.205134] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
Aug 19 22:44:25 celia kernel: [  464.205137] task: ffff88041d957000 ti: ffff88041d98c000 task.ti: ffff88041d98c000
Aug 19 22:44:25 celia kernel: [  464.205138] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
Aug 19 22:44:25 celia kernel: [  464.205143] RSP: 0018:ffff88041d98feb0  EFLAGS: 00000282
Aug 19 22:44:25 celia kernel: [  464.205144] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000001fcc8482
Aug 19 22:44:25 celia kernel: [  464.205145] RDX: 0000006bacc39700 RSI: 0000000000000082 RDI: 0000000000000082
Aug 19 22:44:25 celia kernel: [  464.205146] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205147] R10: ffff88041fcceb40 R11: 0000000000000400 R12: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205148] R13: ffff88041d98ffd8 R14: ffff88041d98c000 R15: ffff88041d98ffd8
Aug 19 22:44:25 celia kernel: [  464.205150] FS:  00007f9ca47a1700(0000) GS:ffff88041fcc0000(0000) knlGS:0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205151] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 19 22:44:25 celia kernel: [  464.205152] CR2: ffffffffff600400 CR3: 000000040bc88000 CR4: 00000000000007e0
Aug 19 22:44:25 celia kernel: [  464.205153] Stack:
Aug 19 22:44:25 celia kernel: [  464.205154]  ffffffff81017688 000000001fcc8482 0000000000000000 ffff88041d98ffd8
Aug 19 22:44:25 celia kernel: [  464.205156]  ffffffff810177ad 0000000000000000 0000000300000000 ffff88041d98ffd8
Aug 19 22:44:25 celia kernel: [  464.205158]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
Aug 19 22:44:25 celia kernel: [  464.205159] Call Trace:
Aug 19 22:44:25 celia kernel: [  464.205163]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
Aug 19 22:44:25 celia kernel: [  464.205165]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
Aug 19 22:44:25 celia kernel: [  464.205168]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
Aug 19 22:44:25 celia kernel: [  464.205171]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
Aug 19 22:44:25 celia kernel: [  464.205172] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 
Aug 19 22:44:25 celia kernel: [  464.205190] NMI backtrace for cpu 1
Aug 19 22:44:25 celia kernel: [  464.205195] CPU: 1 PID: 6202 Comm: zpool Tainted: P           O  3.15.0+tf.1 #2
Aug 19 22:44:25 celia kernel: [  464.205196] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
Aug 19 22:44:25 celia kernel: [  464.205198] task: ffff8804098fb000 ti: ffff8803f06e8000 task.ti: ffff8803f06e8000
Aug 19 22:44:25 celia kernel: [  464.205200] RIP: 0010:[<ffffffff81080e07>]  [<ffffffff81080e07>] do_raw_spin_lock+0xb2/0x130
Aug 19 22:44:25 celia kernel: [  464.205205] RSP: 0018:ffff8803f06ebb08  EFLAGS: 00000002
Aug 19 22:44:25 celia kernel: [  464.205206] RAX: 0000000000000000 RBX: ffff8803f0568a08 RCX: 0000000000000002
Aug 19 22:44:25 celia kernel: [  464.205207] RDX: 0000000000000002 RSI: ffffffff814d269f RDI: 0000000000000001
Aug 19 22:44:25 celia kernel: [  464.205208] RBP: 00000000346f018f R08: 00000000000000cd R09: e0525e8c9b98c0a6
Aug 19 22:44:25 celia kernel: [  464.205209] R10: ffff8803e37ca3b8 R11: 0000000000000001 R12: 00000000e4ed7418
Aug 19 22:44:25 celia kernel: [  464.205210] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205212] FS:  00007fcf45f4eb40(0000) GS:ffff88041fc40000(0000) knlGS:0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205213] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 19 22:44:25 celia kernel: [  464.205214] CR2: 0000000002077028 CR3: 00000003f0412000 CR4: 00000000000007e0
Aug 19 22:44:25 celia kernel: [  464.205215] Stack:
Aug 19 22:44:25 celia kernel: [  464.205216]  ffff8803f0568a00 ffff8804098fb000 0000000000000000 ffffffff813bea20
Aug 19 22:44:25 celia kernel: [  464.205218]  ffff8803f06ebbe0 0000000000000000 0000000000000000 ffffffffa0541402
Aug 19 22:44:25 celia kernel: [  464.205219]  ffff8804098fb000 0000000000000001 ffff8803f06ebbe0 ffff8803f0568a00
Aug 19 22:44:25 celia kernel: [  464.205221] Call Trace:
Aug 19 22:44:25 celia kernel: [  464.205225]  [<ffffffff813bea20>] ? rwsem_down_read_failed+0x2c/0xd9
Aug 19 22:44:25 celia kernel: [  464.205240]  [<ffffffffa0541402>] ? dmu_buf_hold_noread+0x88/0xc5 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205244]  [<ffffffff811f43a4>] ? call_rwsem_down_read_failed+0x14/0x30
Aug 19 22:44:25 celia kernel: [  464.205245]  [<ffffffff813be4e3>] ? down_read+0xe/0xf
Aug 19 22:44:25 celia kernel: [  464.205257]  [<ffffffffa05879f3>] ? zap_get_leaf_byblk+0x132/0x1c3 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205268]  [<ffffffffa0587ae2>] ? zap_deref_leaf+0x5e/0x67 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205279]  [<ffffffffa058882a>] ? fzap_lookup+0xe4/0x101 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205290]  [<ffffffffa058ae17>] ? zap_hashbits+0x5/0x15 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205301]  [<ffffffffa058af06>] ? zap_hash+0x9b/0xb1 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205312]  [<ffffffffa058b0c9>] ? zap_name_alloc+0xba/0xcd [zfs]
Aug 19 22:44:25 celia kernel: [  464.205323]  [<ffffffffa058acbf>] ? mzap_addent+0xec/0xec [zfs]
Aug 19 22:44:25 celia kernel: [  464.205333]  [<ffffffffa058c71b>] ? zap_lookup_norm+0x9c/0x160 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205344]  [<ffffffffa058c84c>] ? zap_lookup+0x2a/0x2f [zfs]
Aug 19 22:44:25 celia kernel: [  464.205353]  [<ffffffffa05bc5ec>] ? zvol_get_stats+0x30/0xb5 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205363]  [<ffffffffa055c151>] ? dsl_prop_nvlist_add_uint64+0x198/0x1a1 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205372]  [<ffffffffa054308e>] ? dmu_objset_stats+0x2b/0x49 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205382]  [<ffffffffa0596863>] ? zfs_ioc_objset_stats_impl+0x77/0xd1 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205391]  [<ffffffffa0596a33>] ? zfs_ioc_objset_stats+0x2a/0x49 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205402]  [<ffffffffa055bd9d>] ? dsl_pool_rele+0xe/0x1c [zfs]
Aug 19 22:44:25 celia kernel: [  464.205412]  [<ffffffffa0599dc3>] ? zfs_ioc_dataset_list_next+0xfa/0x11a [zfs]
Aug 19 22:44:25 celia kernel: [  464.205421]  [<ffffffffa059b7ba>] ? zfsdev_ioctl+0x33b/0x3f7 [zfs]
Aug 19 22:44:25 celia kernel: [  464.205424]  [<ffffffff8112f354>] ? do_vfs_ioctl+0x406/0x44f
Aug 19 22:44:25 celia kernel: [  464.205426]  [<ffffffff8112f3e6>] ? SyS_ioctl+0x49/0x77
Aug 19 22:44:25 celia kernel: [  464.205428]  [<ffffffff813bf788>] ? page_fault+0x28/0x30
Aug 19 22:44:25 celia kernel: [  464.205430]  [<ffffffff813c5f39>] ? system_call_fastpath+0x16/0x1b
Aug 19 22:44:25 celia kernel: [  464.205431] Code: 31 ed eb 2f 66 8b 13 0f b6 c6 0f b6 ca 83 e0 fe 39 c1 75 12 8d 8a 00 02 00 00 89 d0 f0 66 0f b1 0b 66 39 d0 74 66 bf 01 00 00 00 <48> ff c5 e8 ca 24 17 00 4c 39 e5 75 cc 48 c7 c6 b7 26 4d 81 48 
Aug 19 22:44:25 celia kernel: [  464.205448] NMI backtrace for cpu 5
Aug 19 22:44:25 celia kernel: [  464.205452] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           O  3.15.0+tf.1 #2
Aug 19 22:44:25 celia kernel: [  464.205453] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
Aug 19 22:44:25 celia kernel: [  464.205455] task: ffff88041d98a000 ti: ffff88041d9b4000 task.ti: ffff88041d9b4000
Aug 19 22:44:25 celia kernel: [  464.205456] RIP: 0010:[<ffffffff8103ea63>]  [<ffffffff8103ea63>] native_safe_halt+0x2/0x3
Aug 19 22:44:25 celia kernel: [  464.205460] RSP: 0018:ffff88041d9b7eb0  EFLAGS: 00000282
Aug 19 22:44:25 celia kernel: [  464.205461] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020
Aug 19 22:44:25 celia kernel: [  464.205462] RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000082
Aug 19 22:44:25 celia kernel: [  464.205463] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205464] R10: ffff88041fd4eb40 R11: 0000000000000400 R12: 0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205466] R13: ffff88041d9b7fd8 R14: ffff88041d9b4000 R15: ffff88041d9b7fd8
Aug 19 22:44:25 celia kernel: [  464.205467] FS:  00007f984ad9f700(0000) GS:ffff88041fd40000(0000) knlGS:0000000000000000
Aug 19 22:44:25 celia kernel: [  464.205468] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 19 22:44:25 celia kernel: [  464.205469] CR2: ffffffffff600400 CR3: 00000000dd170000 CR4: 00000000000007e0
Aug 19 22:44:25 celia kernel: [  464.205470] Stack:
Aug 19 22:44:25 celia kernel: [  464.205471]  ffffffff81017688 0000000000000020 0000000000000000 ffff88041d9b7fd8
Aug 19 22:44:25 celia kernel: [  464.205473]  ffffffff810177ad 0000000000000000 0000000500000000 ffff88041d9b7fd8
Aug 19 22:44:25 celia kernel: [  464.205474]  ffffffff8107ba73 ffffffff813bf01a 0000000000000010 0000000000000282
Aug 19 22:44:25 celia kernel: [  464.205476] Call Trace:
Aug 19 22:44:25 celia kernel: [  464.205479]  [<ffffffff81017688>] ? default_idle+0x44/0xa1
Aug 19 22:44:25 celia kernel: [  464.205481]  [<ffffffff810177ad>] ? amd_e400_idle+0xc8/0x109
Aug 19 22:44:25 celia kernel: [  464.205483]  [<ffffffff8107ba73>] ? cpu_startup_entry+0x3ad/0x3c6
Aug 19 22:44:25 celia kernel: [  464.205485]  [<ffffffff813bf01a>] ? _raw_spin_unlock_irqrestore+0x1a/0x21
Aug 19 22:44:25 celia kernel: [  464.205486] Code: d8 c3 0f 22 df c3 0f 20 e0 c3 0f 20 e0 c3 0f 22 e7 c3 44 0f 20 c0 c3 44 0f 22 c7 c3 0f 09 c3 9c 58 c3 57 9d c3 fa c3 fb c3 fb f4 <c3> f4 c3 8b 07 53 49 89 ca 49 89 d1 8b 0a 0f a2 89 07 89 1e 41 

@FransUrbo
Copy link
Contributor Author

With zfs/master (and still with openzfs/spl#383), we're back to the previous problem:

[...]
Aug 19 22:58:25 celia kernel: [  549.453514] Call Trace:
Aug 19 22:58:25 celia kernel: [  549.453519]  [<ffffffff81080e0f>] ? do_raw_spin_lock+0xba/0x130
Aug 19 22:58:25 celia kernel: [  549.453522]  [<ffffffff813bea20>] ? rwsem_down_read_failed+0x2c/0xd9
Aug 19 22:58:25 celia kernel: [  549.453538]  [<ffffffffa05d6402>] ? dmu_buf_hold_noread+0x88/0xc5 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453541]  [<ffffffff811f43a4>] ? call_rwsem_down_read_failed+0x14/0x30
Aug 19 22:58:25 celia kernel: [  549.453542]  [<ffffffff813be4e3>] ? down_read+0xe/0xf
Aug 19 22:58:25 celia kernel: [  549.453554]  [<ffffffffa061d1f3>] ? zap_get_leaf_byblk+0x132/0x1c3 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453565]  [<ffffffffa061d2e2>] ? zap_deref_leaf+0x5e/0x67 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453577]  [<ffffffffa061e02a>] ? fzap_lookup+0xe4/0x101 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453587]  [<ffffffffa0620617>] ? zap_hashbits+0x5/0x15 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453598]  [<ffffffffa0620706>] ? zap_hash+0x9b/0xb1 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453609]  [<ffffffffa06208c9>] ? zap_name_alloc+0xba/0xcd [zfs]
Aug 19 22:58:25 celia kernel: [  549.453620]  [<ffffffffa06204bf>] ? mzap_addent+0xec/0xec [zfs]
Aug 19 22:58:25 celia kernel: [  549.453630]  [<ffffffffa0621f1b>] ? zap_lookup_norm+0x9c/0x160 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453641]  [<ffffffffa062204c>] ? zap_lookup+0x2a/0x2f [zfs]
Aug 19 22:58:25 celia kernel: [  549.453650]  [<ffffffffa0651d44>] ? zvol_get_stats+0x30/0xb5 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453660]  [<ffffffffa05f1151>] ? dsl_prop_nvlist_add_uint64+0x198/0x1a1 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453669]  [<ffffffffa05d808e>] ? dmu_objset_stats+0x2b/0x49 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453679]  [<ffffffffa062bfff>] ? zfs_ioc_objset_stats_impl+0x77/0xd1 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453689]  [<ffffffffa062c1cf>] ? zfs_ioc_objset_stats+0x2a/0x49 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453699]  [<ffffffffa05f0d9d>] ? dsl_pool_rele+0xe/0x1c [zfs]
Aug 19 22:58:25 celia kernel: [  549.453709]  [<ffffffffa062f55f>] ? zfs_ioc_dataset_list_next+0xfa/0x11a [zfs]
Aug 19 22:58:25 celia kernel: [  549.453719]  [<ffffffffa0630f56>] ? zfsdev_ioctl+0x33b/0x3f7 [zfs]
Aug 19 22:58:25 celia kernel: [  549.453721]  [<ffffffff8112f354>] ? do_vfs_ioctl+0x406/0x44f
Aug 19 22:58:25 celia kernel: [  549.453723]  [<ffffffff8112f3e6>] ? SyS_ioctl+0x49/0x77
Aug 19 22:58:25 celia kernel: [  549.453725]  [<ffffffff813bf788>] ? page_fault+0x28/0x30
Aug 19 22:58:25 celia kernel: [  549.453728]  [<ffffffff813c5f39>] ? system_call_fastpath+0x16/0x1b
[...]

@FransUrbo
Copy link
Contributor Author

With additional, undocumented, options to -T it seems the pool got imported, even though I got the usual stack trace.

@FransUrbo
Copy link
Contributor Author

And as soon as I run zfs list I get:

Message from syslogd@celia at Aug 19 23:41:10 ...
 kernel:[  462.330024] BUG: spinlock lockup suspected on CPU#2, lt-zpool/5819

@FransUrbo
Copy link
Contributor Author

Running zpool status will give me a correct output of the pool without any errors and all devices marked as ONLINE.

@behlendorf
Copy link
Contributor

Ok. The issue looks the same in all those stacks. I'll post a patch tomorrow to dump the pid of the task holding the offending leaf lock. That should shed some light on this.

@FransUrbo
Copy link
Contributor Author

@behlendorf Regarding perf record... Never used it before, what exact command do you want me to run?

@FransUrbo
Copy link
Contributor Author

No idea what I'm doing, but a perf record zpool status after a zpool import have crashed and created a stack trace gives the following from perf report --header:

# ========
# captured on: Wed Aug 20 20:45:54 2014
# hostname : celia
# os release : 3.15.0+tf.1
# perf version : 3.15.0
# arch : x86_64
# nrcpus online : 6
# nrcpus avail : 6
# cpudesc : AMD Phenom(tm) II X6 1100T Processor
# cpuid : AuthenticAMD,16,10,0
# total memory : 16407472 kB
# cmdline : /usr/local/sbin/perf_3.15.0+tf record zpool status 
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, att
# HEADER_CPU_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, tracepoint = 2, ibs_fetch = 6, ibs_op = 7, breakpoint = 5
# ========
#
# Samples: 29  of event 'cycles'
# Event count (approx.): 16575342
#
# Overhead  Command       Shared Object                      Symbol
# ........  .......  ..................  ..........................
#
    43.06%      net  ld-2.13.so          [.] 0x0000000000009afa    
    36.29%      net  ld-2.13.so          [.] 0x000000000000977e    
     8.78%      net  [kernel.kallsyms]   [k] do_raw_spin_unlock    
     5.72%    zpool  libnvpair.so.1.0.1  [.] nvs_operation         
     5.47%    zpool  [kernel.kallsyms]   [k] get_page_from_freelist
     0.20%    zpool  [kernel.kallsyms]   [k] strncpy_from_user     
     0.18%    zpool  [ext3]              [k] ext3_xattr_get        
     0.15%    zpool  [kernel.kallsyms]   [k] run_timer_softirq     
     0.12%    zpool  [kernel.kallsyms]   [k] flush_tlb_mm_range    
     0.01%    zpool  [kernel.kallsyms]   [k] native_write_msr_safe 


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

Running the suggested perf report --sort comm,dso:

# Samples: 29  of event 'cycles'
# Event count (approx.): 16575342
#
# Overhead  Command       Shared Object
# ........  .......  ..................
#
    79.36%      net  ld-2.13.so        
     8.78%      net  [kernel.kallsyms] 
     5.95%    zpool  [kernel.kallsyms] 
     5.72%    zpool  libnvpair.so.1.0.1
     0.18%    zpool  [ext3]            

No idea where the [ext3] comes from (other than that's the FS on root).

@behlendorf
Copy link
Contributor

@FransUrbo After enabling debugging I hit the following assert. I haven't had a chance to run down why yet. This hit before we got to the code for the debug patch I was intending to test. The debug patch I was intending to test can be found at behlendo/zfs@f4fd052e631637d13b3e5c26df2b150a364c9548. It should return the pid of the process holding the lock, if there is one.

[ 8351.079087] SPLError: 22946:0:(zap_micro.c:80:zap_hash()) ASSERTION(h != 0) failed
[ 8351.079167] SPLError: 22946:0:(zap_micro.c:80:zap_hash()) SPL PANIC
[ 8351.079215] SPL: Showing stack for process 22946
[ 8351.079218] CPU: 3 PID: 22946 Comm: zpool Tainted: P           O  3.15.0+tf.1 #2
[ 8351.079219] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1604 10/16/2012
[ 8351.079220]  0000000000000000 ffffffffa06c1f70 ffffffff813baa5c 0000000000000000
[ 8351.079223]  ffffffffa0144607 ffffffffa01584bf 0000000000000050 ffff8803cb657668
[ 8351.079224]  0000000000000000 ffffffffa06efb40 ffffffffa0150914 ffffffffa015a20c
[ 8351.079226] Call Trace:
[ 8351.079231]  [<ffffffff813baa5c>] ? dump_stack+0x41/0x51
[ 8351.079235]  [<ffffffffa0144607>] ? spl_debug_bug+0x86/0xcf [spl]
[ 8351.079239]  [<ffffffffa0150914>] ? spl_PANIC+0xa4/0xb7 [spl]
[ 8351.079242]  [<ffffffff81116727>] ? __kmalloc+0x108/0x1a4
[ 8351.079245]  [<ffffffffa0147828>] ? kmem_alloc_debug+0x13e/0x2f2 [spl]
[ 8351.079248]  [<ffffffffa0147828>] ? kmem_alloc_debug+0x13e/0x2f2 [spl]
[ 8351.079260]  [<ffffffffa05f7a75>] ? dbuf_update_data+0x5a/0x96 [zfs]
[ 8351.079268]  [<ffffffffa05f7bae>] ? dmu_buf_update_user+0xfd/0x10f [zfs]
[ 8351.079270]  [<ffffffff81116727>] ? __kmalloc+0x108/0x1a4
[ 8351.079272]  [<ffffffffa014784c>] ? kmem_alloc_debug+0x162/0x2f2 [spl]
[ 8351.079275]  [<ffffffffa014784c>] ? kmem_alloc_debug+0x162/0x2f2 [spl]
[ 8351.079287]  [<ffffffffa066f03c>] ? zap_hash+0x84/0x1bb [zfs]
[ 8351.079297]  [<ffffffffa066f8cb>] ? zap_name_alloc+0xbd/0xd1 [zfs]
[ 8351.079308]  [<ffffffffa0671ae7>] ? zap_lookup_norm+0x5b/0x160 [zfs]
[ 8351.079320]  [<ffffffffa063efc2>] ? refcount_remove_many+0x20b/0x275 [zfs]
[ 8351.079329]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079340]  [<ffffffffa0671c59>] ? zap_lookup+0x2a/0x2f [zfs]
[ 8351.079348]  [<ffffffffa06b095a>] ? __zvol_create_minor+0x18b/0x6eb [zfs]
[ 8351.079357]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079365]  [<ffffffffa06b23b1>] ? zvol_create_minor+0x57/0x71 [zfs]
[ 8351.079367]  [<ffffffff813bdfcc>] ? __mutex_unlock_slowpath+0x18/0x39
[ 8351.079375]  [<ffffffffa06b23d2>] ? zvol_create_minors_cb+0x7/0xc [zfs]
[ 8351.079384]  [<ffffffffa06095a2>] ? dmu_objset_find_impl+0x39c/0x3b4 [zfs]
[ 8351.079393]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079402]  [<ffffffffa060939b>] ? dmu_objset_find_impl+0x195/0x3b4 [zfs]
[ 8351.079410]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079419]  [<ffffffffa060939b>] ? dmu_objset_find_impl+0x195/0x3b4 [zfs]
[ 8351.079427]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079436]  [<ffffffffa060939b>] ? dmu_objset_find_impl+0x195/0x3b4 [zfs]
[ 8351.079445]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079453]  [<ffffffffa060939b>] ? dmu_objset_find_impl+0x195/0x3b4 [zfs]
[ 8351.079462]  [<ffffffffa06b23cb>] ? zvol_create_minor+0x71/0x71 [zfs]
[ 8351.079470]  [<ffffffffa06095fc>] ? dmu_objset_find+0x42/0x64 [zfs]
[ 8351.079483]  [<ffffffffa064cdab>] ? spa_open_common+0x2bb/0x2d3 [zfs]
[ 8351.079495]  [<ffffffffa064ce07>] ? spa_get_stats+0x35/0x26b [zfs]
[ 8351.079497]  [<ffffffff81116727>] ? __kmalloc+0x108/0x1a4
[ 8351.079506]  [<ffffffffa067fb3b>] ? zfs_ioc_pool_stats+0x1f/0x58 [zfs]
[ 8351.079516]  [<ffffffffa06843be>] ? zfsdev_ioctl+0x36e/0x42a [zfs]
[ 8351.079519]  [<ffffffff8112f354>] ? do_vfs_ioctl+0x406/0x44f
[ 8351.079521]  [<ffffffff8112f3e6>] ? SyS_ioctl+0x49/0x77
[ 8351.079523]  [<ffffffff813bf788>] ? page_fault+0x28/0x30
[ 8351.079525]  [<ffffffff813c5f39>] ? system_call_fastpath+0x16/0x1b
[ 8533.615248] INFO: task zpool:22946 blocked for more than 120 seconds.
[ 8533.615300]       Tainted: P           O  3.15.0+tf.1 #2

@FransUrbo
Copy link
Contributor Author

After trying again - recreating the pool and push all filesystems (worked just fine), I then used my script to send indiviual (the last) snapshot using -p only.

Woke up this morning with a crashed kernel again :(

Aug 29 09:52:46 celia kernel: [  420.512632] INFO: rcu_sched detected stalls on CPUs/tasks: { 4} (detected by 3, t=15002 jiffies, g=7911, c=7910, q=345)
Aug 29 09:52:46 celia kernel: [  420.512934] Task dump for CPU 4:
Aug 29 09:52:46 celia kernel: [  420.512936] zpool           R  running task        0  5757   5471 0x00000008
Aug 29 09:52:46 celia kernel: [  420.512938]  0000000000000000 ffffffff81079a5b ffff8800c5b1d108 ffffffffa0920070
Aug 29 09:52:46 celia kernel: [  420.512940]  ffff880408097a40 ffffffff8107d6e7 ffff8800c5b1cdb8 ffffffffa1570bec
Aug 29 09:52:46 celia kernel: [  420.512942]  ffff88040c53ca00 ffffffffa09184fd ffff8803c6e36000 ffffffff81079a5b
Aug 29 09:52:46 celia kernel: [  420.512944] Call Trace:
Aug 29 09:52:46 celia kernel: [  420.512950]  [<ffffffff81079a5b>] ? __init_waitqueue_head+0x17/0x25
Aug 29 09:52:46 celia kernel: [  420.512957]  [<ffffffffa0920070>] ? __cv_init+0x7d/0xda [spl]
Aug 29 09:52:46 celia kernel: [  420.512960]  [<ffffffff8107d6e7>] ? __mutex_init+0x21/0x3e
Aug 29 09:52:46 celia kernel: [  420.512975]  [<ffffffffa1570bec>] ? zio_cons+0x39/0x93 [zfs]
Aug 29 09:52:46 celia kernel: [  420.512979]  [<ffffffffa09184fd>] ? spl_kmem_cache_alloc+0xbb/0xc36 [spl]
Aug 29 09:52:46 celia kernel: [  420.512981]  [<ffffffff81079a5b>] ? __init_waitqueue_head+0x17/0x25
Aug 29 09:52:46 celia kernel: [  420.512984]  [<ffffffffa0920070>] ? __cv_init+0x7d/0xda [spl]
Aug 29 09:52:46 celia kernel: [  420.512985]  [<ffffffff8107d6e7>] ? __mutex_init+0x21/0x3e
Aug 29 09:52:46 celia kernel: [  420.512993]  [<ffffffffa14f83ee>] ? dbuf_rele_and_unlock+0x1fc/0x206 [zfs]
Aug 29 09:52:46 celia kernel: [  420.512996]  [<ffffffffa09184fd>] ? spl_kmem_cache_alloc+0xbb/0xc36 [spl]
Aug 29 09:52:46 celia kernel: [  420.512999]  [<ffffffffa09159bc>] ? spl_kmem_cache_free+0x8a/0x359 [spl]
Aug 29 09:52:46 celia kernel: [  420.513001]  [<ffffffff81079a5b>] ? __init_waitqueue_head+0x17/0x25
Aug 29 09:52:46 celia kernel: [  420.513004]  [<ffffffffa0920070>] ? __cv_init+0x7d/0xda [spl]
Aug 29 09:52:46 celia kernel: [  420.513007]  [<ffffffff81123fea>] ? __kmalloc+0x108/0x1a4
Aug 29 09:52:46 celia kernel: [  420.513010]  [<ffffffffa09163ba>] ? kmem_alloc_debug+0x20e/0x43e [spl]
Aug 29 09:52:46 celia kernel: [  420.513018]  [<ffffffffa14f6f1e>] ? dbuf_find+0xac/0xbe [zfs]
Aug 29 09:52:46 celia kernel: [  420.513020]  [<ffffffff81123fea>] ? __kmalloc+0x108/0x1a4
Aug 29 09:52:46 celia kernel: [  420.513023]  [<ffffffffa09163ba>] ? kmem_alloc_debug+0x20e/0x43e [spl]
Aug 29 09:52:46 celia kernel: [  420.513030]  [<ffffffffa14f6f1e>] ? dbuf_find+0xac/0xbe [zfs]
Aug 29 09:52:46 celia kernel: [  420.513033]  [<ffffffffa0915d50>] ? kmem_free_debug+0xc5/0x10d [spl]
Aug 29 09:52:46 celia kernel: [  420.513041]  [<ffffffffa14f8c4a>] ? __dbuf_hold_impl+0x387/0x3b1 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513048]  [<ffffffffa14f7a86>] ? dbuf_read+0x10c/0x553 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513051]  [<ffffffff81206c05>] ? delay_tsc+0x28/0x88
Aug 29 09:52:46 celia kernel: [  420.513053]  [<ffffffff8107ebe9>] ? do_raw_spin_lock+0xba/0x135
Aug 29 09:52:46 celia kernel: [  420.513055]  [<ffffffff813dc705>] ? rwsem_down_read_failed+0x2c/0xd9
Aug 29 09:52:46 celia kernel: [  420.513064]  [<ffffffffa14ff3cc>] ? dmu_buf_hold_noread+0x88/0xc5 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513066]  [<ffffffff81207c14>] ? call_rwsem_down_read_failed+0x14/0x30
Aug 29 09:52:46 celia kernel: [  420.513068]  [<ffffffff813dc326>] ? down_read+0xe/0xf
Aug 29 09:52:46 celia kernel: [  420.513080]  [<ffffffffa154645f>] ? zap_get_leaf_byblk+0x132/0x1c3 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513092]  [<ffffffffa154654e>] ? zap_deref_leaf+0x5e/0x67 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513103]  [<ffffffffa1547296>] ? fzap_lookup+0xe4/0x101 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513115]  [<ffffffffa154988c>] ? zap_hashbits+0x5/0x15 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513126]  [<ffffffffa154997b>] ? zap_hash+0x9b/0xb1 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513138]  [<ffffffffa1549b3e>] ? zap_name_alloc+0xba/0xcd [zfs]
Aug 29 09:52:46 celia kernel: [  420.513149]  [<ffffffffa1549734>] ? mzap_addent+0xef/0xef [zfs]
Aug 29 09:52:46 celia kernel: [  420.513161]  [<ffffffffa154b196>] ? zap_lookup_norm+0x9c/0x160 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513172]  [<ffffffffa154b2c7>] ? zap_lookup+0x2a/0x2f [zfs]
Aug 29 09:52:46 celia kernel: [  420.513182]  [<ffffffffa157b0f3>] ? zvol_get_stats+0x30/0xb5 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513193]  [<ffffffffa151a271>] ? dsl_prop_nvlist_add_uint64+0x198/0x1a1 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513202]  [<ffffffffa1501086>] ? dmu_objset_stats+0x2b/0x49 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513212]  [<ffffffffa1555263>] ? zfs_ioc_objset_stats_impl+0x77/0xd1 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513223]  [<ffffffffa1555433>] ? zfs_ioc_objset_stats+0x2a/0x49 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513234]  [<ffffffffa1519ebd>] ? dsl_pool_rele+0xe/0x1c [zfs]
Aug 29 09:52:46 celia kernel: [  420.513244]  [<ffffffffa15587a9>] ? zfs_ioc_dataset_list_next+0xfa/0x11a [zfs]
Aug 29 09:52:46 celia kernel: [  420.513255]  [<ffffffffa155a1db>] ? zfsdev_ioctl+0x33b/0x3f7 [zfs]
Aug 29 09:52:46 celia kernel: [  420.513258]  [<ffffffff8113cecc>] ? do_vfs_ioctl+0x406/0x44f
Aug 29 09:52:46 celia kernel: [  420.513260]  [<ffffffff81063fed>] ? task_work_run+0x78/0x87
Aug 29 09:52:46 celia kernel: [  420.513262]  [<ffffffff8113cf5e>] ? SyS_ioctl+0x49/0x77
Aug 29 09:52:46 celia kernel: [  420.513264]  [<ffffffff813df7c8>] ? page_fault+0x28/0x30
Aug 29 09:52:46 celia kernel: [  420.513266]  [<ffffffff813dd929>] ? system_call_fastpath+0x16/0x1b

Script basically only do

NC_OPT_L="-l -p $PORT $SSH_HOST"
NC_OPT_R="-n"
            [...]
            (ssh -n $SSH_HOST nc $NC_OPT_L \| zfs receive -uvF \"$dset\" 2>&1 | tee /tmp/rcv.$$-1) &
            [...]
            (zfs send -p "$local_snap" | nc $NC_OPT_R $SSH_HOST $PORT 2>&1 | tee /tmp/rcv.$$-2)
            [...]
            grep -Eq 'invalid backup stream|Invalid exchange|invalid name' /tmp/rcv.$$* && [run this again]

So it seems even a -p messes with the pool - it is no longer importable :(. So I'm going to try this yet again, this time without even -p...

@FransUrbo
Copy link
Contributor Author

Adding a dprintf() to get h just before the ASSERT() that is triggered:

http://bayour.com/misc/spl-log.1410725511.550615.txt

@FransUrbo
Copy link
Contributor Author

I think I found the problem...

01000000:00000080:3:1410726240.962462:0:6132:0:(dbuf.c:1827:dbuf_create()) ds=share/VirtualMachines/Debian/Sid/ZFSTest-SCST obj=1 lvl=0 blkid=157 db=ffff8803f98a37a8
01000000:00000080:0:1410726240.963228:0:5474:0:(dbuf.c:1827:dbuf_create()) ds=share/VirtualMachines/Debian/Wheezy/64 obj=2 lvl=0 blkid=0 db=ffff8803f2e37b78
00001000:00000010:0:1410726240.963239:0:5474:0:(zap_micro.c:82:zap_hash()) ASSERTION(h != 0) failed
00001000:00000010:0:1410726240.963357:0:5474:0:(zap_micro.c:82:zap_hash()) SPL PANIC
01000000:00000080:5:1410726240.963821:0:6144:0:(dbuf.c:1827:dbuf_create()) ds=share/VirtualMachines/Debian/Sid/Devel obj=1 lvl=0 blkid=158 db=ffff8803f2ed6048

This dataset (share/VirtualMachines/Debian/Wheezy/64 is ... "broken" somehow. It is (was) of 0 size).I deleted it a few days ago on the FreeNAS9 machine (because I was getting messages on the console about it).

And I don't have a 'backup' of the metadata for it. I just remember seeing that it was 0. I think it was the volsize that was 0 (it's supposed to be 15GB)...

So apparently the FreeNAS9 machine happily sent the ZVOL to the ZoL machine, which after it received it crashed. And now, when it tries to import the pool, it sees this ZVOL, and then borks...

@FransUrbo
Copy link
Contributor Author

Thinking even harder (and looking/grep'ing at the metadata I've kept about this whole thing, ever since the original crash back in January) I'm starting to remember that all the snapshots for that ZVOL was also broken.

How I got it over to the FreeNAS9 machine is still a mystery, but I think that I had to snapshot it there (did a recursive snap from the top) and then pushed that back...

@FransUrbo
Copy link
Contributor Author

And getting the extra dprintf() lines correct (doh blush) - http://bayour.com/misc/spl-log.1410728040.29116.txt:

01000000:00000080:4:1410728032.624322:0:28481:0:(dbuf.c:1827:dbuf_create()) ds=share/VirtualMachines/Debian/Wheezy/64 obj=2 lvl=0 blkid=0 db=ffff880401ea27e8
01000000:00000080:4:1410728032.624333:0:28481:0:(zap_micro.c:81:zap_hash()) => 0 != 0
00001000:00000010:4:1410728032.624334:0:28481:0:(zap_micro.c:82:zap_hash()) ASSERTION(h != 0) failed

@FransUrbo
Copy link
Contributor Author

Commenting out the ASSERT gives:

# zpool import -d /dev/disk/by-id share

Message from syslogd@celia at Sep 14 21:06:51 ...
 kernel:[  463.585538] SPLError: 28153:0:(zap.c:580:zap_deref_leaf()) VERIFY3(zap->zap_u.zap_fat.zap_phys->zap_magic == 0x2F52AB2ABULL) failed (0 == 12703150763)

Message from syslogd@celia at Sep 14 21:06:51 ...
 kernel:[  463.585673] SPLError: 28153:0:(zap.c:580:zap_deref_leaf()) SPL PANIC

@behlendorf
Copy link
Contributor

@FransUrbo That dataset/zvol is clearly damaged on disk. You could probably patch zap_hash() to return an error instead of hitting the ASSERT. Then just make sure an error gets returned all the way to the ioctl(). This might allow you to import the rest of the pool so you can cleanly destroy this broken dataset.

@FransUrbo
Copy link
Contributor Author

I tried make it return 0 but that obviously not enough. Thanx for the tip on "patch zap_hash() to return an error ... all the way to the ioctl()". That's probably something I could do. Doesn't seem to difficult.

But shouldn't we investigate this further? A broken ZVOL that cause this much trouble... ? It's obviously not that big of a problem on *BSD...

Any information or something else you want me to take before I go on trying to destroy it?

@FransUrbo
Copy link
Contributor Author

Dang! It seems there was a downside of doing that... The ZVOL is not in the list and a get all and destroy -r gives:

# zfs get all share/VirtualMachines/Debian/Wheezy/64
cannot open 'share/VirtualMachines/Debian/Wheezy/64': unsupported version or feature
# zfs destroy -r share/VirtualMachines/Debian/Wheezy/64
cannot open 'share/VirtualMachines/Debian/Wheezy/64': unsupported version or feature

@FransUrbo
Copy link
Contributor Author

Not sure if that's because the way I did it:

diff --git a/module/zfs/zap_micro.c b/module/zfs/zap_micro.c
index 5d812bd..fd0da2e 100644
--- a/module/zfs/zap_micro.c
+++ b/module/zfs/zap_micro.c
@@ -77,7 +77,8 @@ zap_hash(zap_name_t *zn)
                h = *(uint64_t *)zn->zn_key_orig;
        } else {
                h = zap->zap_salt;
-               ASSERT(h != 0);
+               if (h == 0)
+                       return(-1);
                ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);

                if (zap_getflags(zap) & ZAP_FLAG_UINT64_KEY) {
@@ -196,6 +197,8 @@ zap_name_alloc(zap_t *zap, const char *key, matchtype_t mt)
        }

        zn->zn_hash = zap_hash(zn);
+       if (zn->zn_hash == -1)
+               return (NULL);
        return (zn);
 }
 diff --git a/module/zfs/zfs_ioctl.c b/module/zfs/zfs_ioctl.c
index ddd5b0d..906f2bb 100644
--- a/module/zfs/zfs_ioctl.c
+++ b/module/zfs/zfs_ioctl.c
@@ -2013,7 +2013,7 @@ zfs_ioc_objset_stats_impl(zfs_cmd_t *zc, objset_t *os)
                if (!zc->zc_objset_stats.dds_inconsistent &&
                    dmu_objset_type(os) == DMU_OST_ZVOL) {
                        error = zvol_get_stats(os, nv);
-                       if (error == EIO)
+                       if (error == EIO || error == ENOTSUP)
                                return (error);
                        VERIFY0(error);
                }

ENOTSUP because zvol_get_stats() calls zap_lookup() which calls zap_name_alloc() and that's where I return NULL which leads the return from zap_lookup() returning return (SET_ERROR(ENOTSUP))...

This gives:

# zpool import -d /dev/disk/by-id share
cannot iterate filesystems: unsupported version or feature
*** glibc detected *** zpool: free(): invalid pointer: 0x00000000009bc7c0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76d76)[0x7f9d4d9ddd76]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x6c)[0x7f9d4d9e2b1c]
/lib64/libnvpair.so.1(nvlist_free+0x68)[0x7f9d4ef56978]
/lib64/libzfs.so.2(zfs_close+0x21)[0x7f9d4e765561]
/lib64/libzfs.so.2(zpool_enable_datasets+0x198)[0x7f9d4e773298]
zpool[0x407965]
zpool[0x40dbb9]
zpool[0x40566c]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f9d4d985ead]
zpool[0x405791]
[....]

@FransUrbo
Copy link
Contributor Author

The pool got imported in any case and most (625 out of 640) filesystems is mountable:

# zfs mount -a
cannot iterate filesystems: unsupported version or feature

So returning ENOTSUP from the ioctl is obviously not correct...

@FransUrbo
Copy link
Contributor Author

Crud! All this messing about eventually lead to The pool metadata is corrupted. Ah, well. This time the broken ZVOL isn't there, so a complete resend should be good...

@FransUrbo
Copy link
Contributor Author

After starting from scratch and initiating the send, I decided to take a nap. Four hours later, I wake up to a machine with a locked CPU. A hard-lock, if not for the nmi_watchdog=0....

Importing the pool went mostly ok, although I got

Sep 15 12:58:07 celia kernel: [ 1205.770239] SPLError: 8305:0:(spl-err.c:88:vcmn_err()) WARNING: can't open objset for share/VirtualMachines/MOSIX/Disk1

It's not in the list of filesystems/volumes...

@FransUrbo
Copy link
Contributor Author

Starting the send again, skipping (with my -s option to receive) the ones I already have, it receives 5 ZVOLs and then

cannot receive incremental stream: dataset is busy

@FransUrbo
Copy link
Contributor Author

Looks like that part is related to #1761.

@FransUrbo
Copy link
Contributor Author

After a couple of crashes, reboots and imports (these now work without any problem) I've started to sync the filesystems.

I'm now up to my second crash/hard lock (nmi_watchdog=1 doesn't help any longer - the ssh session dies and I can't login on the terminals).

Memory (usage) problems?

                              zfsARCHits zfsARCMisses zfsARCDemandDataHits zfsARCDemandDataMisses zfsARCDemandMetaHits zfsARCDemandMetaMisses zfsARCPrefetchDataHits zfsARCPrefetchDataMisses zfsARCPrefetchMetaHits zfsARCPrefetchMetaMisses zfsARCL2Hits zfsARCL2Misses
Tue Sep 16 23:50:01 CEST 2014    9018235      8783262                11086                   3021              7758258                7620180                   2682                     8741                1246209                  1151320            0              0
Tue Sep 16 23:55:01 CEST 2014    9018235      8783262                11086                   3021              7758258                7620180                   2682                     8741                1246209                  1151320            0              0
Wed Sep 17 00:00:01 CEST 2014    9018235      8783262                11086                   3021              7758258                7620180                   2682                     8741                1246209                  1151320            0              0
Wed Sep 17 00:05:01 CEST 2014    9164504      8960353                11086                   3021              7904527                7797271                   2682                     8741                1246209                  1151320            0              0
Wed Sep 17 00:10:01 CEST 2014    9164504      8960353                11086                   3021              7904527                7797271                   2682                     8741                1246209                  1151320            0              0
Wed Sep 17 00:15:01 CEST 2014    9308369      9135334                11086                   3021              8048390                7972242                   2682                     8741                1246211                  1151330            0              0
Wed Sep 17 00:20:01 CEST 2014    9308369      9135334                11086                   3021              8048390                7972242                   2682                     8741                1246211                  1151330            0              0
Wed Sep 17 00:25:01 CEST 2014    9449746      9308571                11086                   3021              8189767                8145461                   2682                     8741                1246211                  1151348            0              0
Wed Sep 17 00:30:01 CEST 2014    9449746      9308571                11086                   3021              8189767                8145461                   2682                     8741                1246211                  1151348            0              0
Wed Sep 17 00:35:02 CEST 2014    9595614      9485911                11086                   3021              8335635                8322797                   2682                     8741                1246211                  1151352            0              0
Wed Sep 17 00:40:01 CEST 2014    9595614      9485911                11086                   3021              8335635                8322797                   2682                     8741                1246211                  1151352            0              0
Wed Sep 17 00:45:01 CEST 2014    9738269      9659272                11086                   3021              8478290                8496140                   2682                     8741                1246211                  1151370            0              0
Wed Sep 17 00:50:01 CEST 2014    9738269      9659272                11086                   3021              8478290                8496140                   2682                     8741                1246211                  1151370            0              0
Wed Sep 17 00:55:01 CEST 2014
                              zfsARCMetaMax zfsARCMetaUsed zfsARCMetaLimit zfsARCCMax zfsARCSize
Tue Sep 16 23:50:01 CEST 2014    2006069880     1908940640      2005423616 4105553920 4105576288
Tue Sep 16 23:55:01 CEST 2014    2006069880     1908940640      2005423616 4105553920 4105576288
Wed Sep 17 00:00:01 CEST 2014    2006069880     1908940640      2005423616 4105553920 4105576288
Wed Sep 17 00:05:01 CEST 2014    2006069880     1970166760      2005423616 4105553920 4105657832
Wed Sep 17 00:10:01 CEST 2014    2006069880     1970166760      2005423616 4105553920 4105657832
Wed Sep 17 00:15:01 CEST 2014    2006069880     2005325224      2005423616 4105553920 4105622952
Wed Sep 17 00:20:01 CEST 2014    2006069880     2005325224      2005423616 4105553920 4105622952
Wed Sep 17 00:25:01 CEST 2014    2006069880     2005402112      2005423616 4105553920 4105568312
Wed Sep 17 00:30:01 CEST 2014    2006069880     2005402112      2005423616 4105553920 4105568312
Wed Sep 17 00:35:02 CEST 2014    2006069880     2005306904      2005423616 4105553920 4105473560
Wed Sep 17 00:40:01 CEST 2014    2006069880     2005306904      2005423616 4105553920 4105473560
Wed Sep 17 00:45:01 CEST 2014    2006069880     2005287376      2005423616 4105553920 4105585104
Wed Sep 17 00:50:01 CEST 2014    2006069880     2005287376      2005423616 4105553920 4105585104
Wed Sep 17 00:55:01 CEST 2014
                              zfsZILCommitCnt zfsZILCommitWriterCnt zfsZILITXCnt zfsZILITXIndirectCnt zfsZILITXIndirectBytes zfsZILITXCopiedCnt zfsZILITXCopiedBytes zfsZILITXNeedcopyCnt zfsZILITXNeedcopyBytes zfsZILITXMetaslabNormalCnt zfsZILITXMetaslabNormalBytes zfsZILITXMetaslabSLOGCnt zfsZILITXMetaslabSLOGBytes
Tue Sep 16 23:50:01 CEST 2014             481                   481            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Tue Sep 16 23:55:01 CEST 2014             481                   481            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:00:01 CEST 2014             481                   481            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:05:01 CEST 2014             485                   485            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:10:01 CEST 2014             485                   485            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:15:01 CEST 2014             487                   487            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:20:01 CEST 2014             487                   487            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:25:01 CEST 2014             487                   487            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:30:01 CEST 2014             487                   487            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:35:02 CEST 2014             491                   491            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:40:01 CEST 2014             491                   491            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:45:01 CEST 2014             491                   491            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:50:01 CEST 2014             491                   491            0                    0                      0                  0                    0                    0                      0                          0                            0                        0                          0
Wed Sep 17 00:55:01 CEST 2014

@FransUrbo
Copy link
Contributor Author

Yet another crash. This time I got the advice to use netconsole. But of course, not using debugging modules :(. I'll install those and try again. In the meantime, here's the output on the remote host:

http://bayour.com/misc/screenlog.0.txt

@FransUrbo
Copy link
Contributor Author

Running with debugging modules:

http://bayour.com/misc/screenlog.2.txt

This time, only 9 snapshots...

freenas# grep ^received /share/x                                                                                                                              received 34.1GB stream in 390 seconds (89.5MB/sec)
received 312B stream in 1 seconds (312B/sec)
received 122KB stream in 1 seconds (122KB/sec)
received 33.9GB stream in 385 seconds (90.2MB/sec)
received 312B stream in 1 seconds (312B/sec)
received 312B stream in 1 seconds (312B/sec)
received 16.1GB stream in 187 seconds (88.2MB/sec)
received 312B stream in 1 seconds (312B/sec)
received 312B stream in 1 seconds (312B/sec)

@FransUrbo
Copy link
Contributor Author

Using MD+LVM+XFS in the same type of layout as the previous zpool, I managed to copy ALL data without so much as a hickup or warning. Not a pip from the system!

Going back to a zpool, I this time decided to use find | cpio from a SMB share. This gave a stacktrace (EDIT: update url - http://bayour.com/misc/screenlog.7.txt) almost immediately, but both the kernel, machine and the zpool continued to work so I did nothing to it.

It's no where near finished (~2TB of ~22TB). It's going to take ages, but seems to work.

@behlendorf
Copy link
Contributor

@FransUrbo you must be running with #2387 which clearly introduced this NULL dereference. The function in question vdev_disk_time_out_handler() was added by that patch. I'd suggest reverting that patch.

@FransUrbo
Copy link
Contributor Author

@behlendorf You mean for the find|cpio problem or the whole thing?

@behlendorf
Copy link
Contributor

@FransUrbo the kernel NULL dereference basically means you've halted that process. If it's holding any resources or locks other processes need you system may deadlock. You can certainly continue to limp along just be aware of that. Reverting the vdev timeout patch will resolve the NULL dereference you hit.

@FransUrbo
Copy link
Contributor Author

Btw, I see that I never changed the url in #2602 (comment). It should have been http://bayour.com/misc/screenlog.7.txt.

Removing #2387 and trying again gives http://bayour.com/misc/screenlog.8.txt after a while.

@FransUrbo
Copy link
Contributor Author

Conclusion must be that this must have been a hardware issue at hart...

Borrowed a second machine from my friend (dedicated server), but with my disks. The transfer was successful. It only crashed twice, both times completely unrelated to ZoL.

@behlendorf
Copy link
Contributor

I'm glad you got it sorted out. Although I was hoping for a more satisfying answer...

ryao pushed a commit to ryao/zfs that referenced this issue Nov 29, 2014
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object.  This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().

The good news is that is a situation which can not occur unless the
pool has been damaged.  The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools.  Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.

Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object.  If a non-ZAP
object has been passed it EINVAL will be returned immediately.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#2597
Issue openzfs#2602
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@behlendorf @FransUrbo and others