Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txg_sync hung task and deadlock when trying to shrink ARC #4319

Closed
perfinion opened this issue Feb 8, 2016 · 23 comments
Closed

txg_sync hung task and deadlock when trying to shrink ARC #4319

perfinion opened this issue Feb 8, 2016 · 23 comments
Labels
Component: Memory Management kernel memory management Status: Inactive Not being actively updated Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@perfinion
Copy link
Contributor

I have intermittently been hitting a deadlock and finally setup a VM to reproduce it and get real stacktraces. This is done on master for SPL and ZFS over the weekend so it is completely up to date. The VM has 1GB ram and the root drive is on ext4 and set up a mirror zpool on two 1GB disks and a couple random datasets in it. I then rsync'd the portage tree (lots of little files) into /tank/.

The rough steps to trigger it are

  1. make sure the ARC is as big as possible (find /tank -exec cat {} + > /dev/null 2>&1 did that nicely).
  2. having a scrub run helped too.
  3. arcstat.py shows the arc is using lots of space and htop shows ram as almost completely used up.
  4. I wrote a little C program to malloc() a lot of ram and mlockall() then just while(1) sleep().
  5. kswapd starts using CPU time and ARC starts to shrink a bit and if run for long enough everything will completely lock up then the kernel's hung task detection kicks in and reboots the machine. The OOM killer should kill a few of the eatram processes but never gets there.

I have been following #4106 and #4166 but am still hitting this :(.

I have the gentoo hardened 4.3.3-r4 kernel but I dont think the kernel version matters since I have been hitting it for a while.

I think this is the most relevant stacktrace:

[ 3688.586051] txg_sync        D ffff88003d9fc630  9784 20739      2 0x00000000
[ 3688.586051]  ffffc90000413a68 0000000000000046 0000000000000000 ffff880003b09740
[ 3688.586051]  ffffc90000413a70 ffff88003e10c080 ffff88003d9fc080 ffffffffc0478a80
[ 3688.586051]  ffffffffa28224ea e103ae4808b3649e ffff88003d9fcaf8 7fffffffffffffff
[ 3688.586051] Call Trace:
[ 3688.586051]  [<ffffffffc0478a80>] ? zio_taskq_member.isra.4.constprop.11+0x70/0x70 [zfs]
[ 3688.586051]  [<ffffffffa28224ea>] ? schedule_timeout+0x25a/0x310
[ 3688.586051]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.586051]  [<ffffffffa28224ea>] schedule_timeout+0x25a/0x310
[ 3688.586051]  [<ffffffffa2100ea4>] ? preempt_count_add+0x54/0xa0
[ 3688.586051]  [<ffffffffa2048383>] ? kvm_clock_read+0x23/0x50
[ 3688.586051]  [<ffffffffa20483b9>] ? kvm_clock_get_cycles+0x9/0x20
[ 3688.586051]  [<ffffffffa2140a2d>] ? ktime_get+0x4d/0xd0
[ 3688.586051]  [<ffffffffa281cff1>] io_schedule_timeout+0xb1/0x130
[ 3688.586051]  [<ffffffffc000c9a7>] cv_wait_common+0x117/0x2f0 [spl]
[ 3688.586051]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.586051]  [<ffffffffc000cc07>] __cv_wait_io+0x27/0x40 [spl]
[ 3688.586051]  [<ffffffffc047c6a1>] zio_wait+0x181/0x360 [zfs]
[ 3688.586051]  [<ffffffffc03e3458>] dsl_pool_sync+0x118/0x5f0 [zfs]
[ 3688.586051]  [<ffffffffc040647d>] spa_sync+0x36d/0xe10 [zfs]
[ 3688.586051]  [<ffffffffa2048383>] ? kvm_clock_read+0x23/0x50
[ 3688.586051]  [<ffffffffc041dad5>] txg_sync_thread+0x3c5/0x6a0 [zfs]
[ 3688.586051]  [<ffffffffc041d710>] ? txg_fini+0x2e0/0x2e0 [zfs]
[ 3688.586051]  [<ffffffffc00051d5>] thread_generic_wrapper+0x95/0xe0 [spl]
[ 3688.586051]  [<ffffffffc0005140>] ? __thread_exit+0x20/0x20 [spl]
[ 3688.586051]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.586051]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.586051]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.586051]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] kswapd0         D ffffffffa2107ef0 12192   129      2 0x00000000
[ 3688.132006]  ffffc9000041b888 0000000000000046 ffffffffa2100e10 ffffc9000041b8c8
[ 3688.132006]  ffffc900000135d8 ffffffffa2c04980 ffff88003da38ac0 ffffc9000041b868
[ 3688.132006]  ffffffffc000ca27 72e0da80ceb50efd ffff88003da39538 ffff88003d32c290
[ 3688.132006] Call Trace:
[ 3688.132006]  [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.132006]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.132006]  [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.132006]  [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.132006]  [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.132006]  [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.132006]  [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.132006]  [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.132006]  [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.132006]  [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.132006]  [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.132006]  [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.132006]  [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.132006]  [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.132006]  [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.132006]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.132006]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.132006]  [<ffffffffa21e1f5f>] kswapd+0x53f/0x9a0
[ 3688.132006]  [<ffffffffa21e1a20>] ? mem_cgroup_shrink_node_zone+0x200/0x200
[ 3688.132006]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] spl_dynamic_tas D ffff88002aec85b0 14192 20666      2 0x00000000
[ 3688.132006]  ffffc90005313b38 0000000000000046 ffffffffa210e184 0000000000000000
[ 3688.132006]  ffff88003fd12c40 ffff88003e108ac0 ffff88002aec8000 0000000000000000
[ 3688.132006]  ffffffffa28224ea 9d481ca78e65b67c ffff88002aec8a78 7fffffffffffffff
[ 3688.132006] Call Trace:
[ 3688.132006]  [<ffffffffa210e184>] ? enqueue_entity+0x4f4/0xc40
[ 3688.132006]  [<ffffffffa28224ea>] ? schedule_timeout+0x25a/0x310
[ 3688.132006]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006]  [<ffffffffa28224ea>] schedule_timeout+0x25a/0x310
[ 3688.132006]  [<ffffffffa281ef37>] ? wait_for_completion_killable+0x47/0x1f0
[ 3688.132006]  [<ffffffffa281f007>] wait_for_completion_killable+0x117/0x1f0
[ 3688.132006]  [<ffffffffa2822290>] ? usleep_range+0x90/0x90
[ 3688.132006]  [<ffffffffa20ffb10>] ? wake_up_q+0x80/0x80
[ 3688.132006]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3688.132006]  [<ffffffffc0013882>] ? __FUNCTION__.26687+0x39a1/0x49a7 [spl]
[ 3688.132006]  [<ffffffffa20f4836>] kthread_create_on_node+0xf6/0x1c0
[ 3688.132006]  [<ffffffffa2463c57>] ? string.isra.4+0x47/0xe0
[ 3688.132006]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3688.586051]  [<ffffffffc00052bb>] spl_kthread_create+0x9b/0xf0 [spl]
[ 3688.586051]  [<ffffffffc0007639>] taskq_thread_create+0x69/0x110 [spl]
[ 3688.586051]  [<ffffffffc00076f5>] taskq_thread_spawn_task+0x15/0x40 [spl]
[ 3688.586051]  [<ffffffffc0006bb2>] taskq_thread+0x2a2/0x5c0 [spl]
[ 3688.586051]  [<ffffffffa20ffb10>] ? wake_up_q+0x80/0x80
[ 3688.586051]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3688.586051]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.586051]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.586051]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.586051]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0

[ 3688.132006] khugepaged      D ffff88003fd12140 11856    37      2 0x00000000
[ 3688.132006]  ffffc9000013b5d8 0000000000000046 ffffffffa2100e10 0000000000000000
[ 3688.132006]  0000000000000002 ffff88003d98eb80 ffff88003e2bb5c0 0000000000000001
[ 3688.132006]  ffffffffc000ca27 02d1924d7a783fd3 ffff88003e2bc038 ffff88003d32c290
[ 3688.132006] Call Trace:
[ 3688.132006]  [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.132006]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.132006]  [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.132006]  [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.132006]  [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.132006]  [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.132006]  [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.132006]  [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.132006]  [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.132006]  [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.132006]  [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.132006]  [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.132006]  [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.132006]  [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.132006]  [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.132006]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.132006]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.132006]  [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3688.132006]  [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3688.132006]  [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3688.132006]  [<ffffffffa222eed5>] khugepaged+0x475/0x16a0
[ 3688.132006]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006]  [<ffffffffa222ea60>] ? maybe_pmd_mkwrite+0x40/0x40
[ 3688.132006]  [<ffffffffa222ea60>] ? maybe_pmd_mkwrite+0x40/0x40
[ 3688.132006]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0

Most of the other processes are like this (kthreadd, init, python, find, htop ...):

[ 3688.586051] python3.4       D ffff88003cd45bb0     8  3210  20765 0x00000000
[ 3688.586051]  ffffc900005bb3e8 0000000000000086 ffffffffa2100e10 ffffc900005bb428
[ 3688.586051]  ffffc900007e33c8 ffff88003e22cb40 ffff88003cd45600 ffffc900005bb3c8
[ 3688.586051]  ffffffffc000ca27 87e75cd14186f159 ffff88003cd46078 ffff88003d32c290
[ 3688.586051] Call Trace:
[ 3688.586051]  [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.586051]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.586051]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.586051]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.586051]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.586051]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.586051]  [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.586051]  [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.586051]  [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.586051]  [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.586051]  [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.586051]  [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.586051]  [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.586051]  [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.586051]  [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.586051]  [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.586051]  [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.586051]  [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.586051]  [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.586051]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.586051]  [<ffffffffa251128d>] ? alloc_indirect.isra.3+0x2d/0x70
[ 3688.586051]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.586051]  [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3688.586051]  [<ffffffffa21dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 3688.586051]  [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3688.586051]  [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3688.586051]  [<ffffffffa282311e>] ? _raw_spin_unlock_irqrestore+0x2e/0x50
[ 3688.586051]  [<ffffffffa21d7940>] __do_page_cache_readahead+0x130/0x290
[ 3688.586051]  [<ffffffffa21c91dc>] ? find_get_entry+0x6c/0xa0
[ 3688.586051]  [<ffffffffa21c93ad>] ? pagecache_get_page+0x2d/0x1c0
[ 3688.586051]  [<ffffffffa21cbc53>] filemap_fault+0x3a3/0x460
[ 3688.586051]  [<ffffffffa21fb41f>] __do_fault+0x7f/0x130
[ 3688.586051]  [<ffffffffa21fff05>] handle_mm_fault+0xb75/0x1750
[ 3688.586051]  [<ffffffffa204d930>] __do_page_fault+0x220/0x6d0
[ 3688.586051]  [<ffffffffa204de99>] trace_do_page_fault+0x49/0x150
[ 3688.586051]  [<ffffffffa2047dac>] do_async_page_fault+0x2c/0xa0
[ 3688.586051]  [<ffffffffa2825c68>] async_page_fault+0x28/0x30

How can I help debugging?
-- Jason

@behlendorf
Copy link
Contributor

@perfinion it appears that the I/O pipeline is stalled waiting on a kernel thread to start in order to handle an outstanding I/O/. Presumably because you've done an excellent job exhausting memory on your system, although that's not 100% clear from the stack traces. Here are the key bits:

txg_sync -> waiting in zio_wait() for an I/O to complete, normally handled by a z_* pipeline thread.
spl_dynamic_task -> waiting in kthread_create_on_node() for confirmation that a thread it requested be started is now running.

If you still have the system in this start it would be useful to see what kthreadd is doing. It's responsible for signalling when the new thread has been successfully started, so it would be useful to know why exactly it hasn't.

Setting spl_taskq_thread_dynamic=0 as a work around should prevent this deadlock by leaving those threads running all the time. That way there's no need to start and stop them. We'll need to think of some way to harden the dynamic behavior further. @tuxoko or @dweeezil may have some ideas here.

@perfinion
Copy link
Contributor Author

I have the VM unchanged so I can get any more info that would help. Once it locks up the machine is completely unusable, not only things that hit disk, but absolutely everything hangs and then the hung tasks panic reboots the machine.

I am also attaching the full dmesg output from the hung_tasks_timeout because it has a z_* thread too. The earlier stacks were mostly from sysrq+W to print stacks for all processes in the D state. In future I could do sysrq+t to dump everything if that helps. Other than magic sysreq, I can only really do things from outside of qemu cuz nothing inside works once it starts dying.

I will try and trigger it later with spl_taskq_thread_dynamic=0 to verify that fixes it. right now I have absolutely nothing set so everything is default.

kthreadd's stacktrace:

[ 3688.132006] kthreadd        D ffff88003fd12140 11200     2      0 0x00000000
[ 3688.132006]  ffffc90000013598 0000000000000046 ffffffffa2100e10 0000000000000000
[ 3688.132006]  0000000000000002 ffff880029622b00 ffff88003e108ac0 0000000000000001
[ 3688.132006]  ffffffffc000ca27 d9f62eed1d5aee64 ffff88003e109538 ffff88003d32c290
[ 3688.132006] Call Trace:
[ 3688.132006]  [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.132006]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.132006]  [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.132006]  [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.132006]  [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.132006]  [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.132006]  [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.132006]  [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.132006]  [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.132006]  [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.132006]  [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.132006]  [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.132006]  [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.132006]  [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.132006]  [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.132006]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.132006]  [<ffffffffa21d9cea>] ? add_page_to_unevictable_list+0xaa/0xc0
[ 3688.132006]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.132006]  [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3688.132006]  [<ffffffffa21d191a>] ? get_page_from_freelist+0xea/0x6d0
[ 3688.132006]  [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3688.132006]  [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3688.132006]  [<ffffffffa2105e12>] ? sched_move_task+0x32/0x250
[ 3688.132006]  [<ffffffffa21d2db4>] alloc_kmem_pages_node+0x44/0xf0
[ 3688.132006]  [<ffffffffa20ca131>] copy_process+0x261/0x2610
[ 3688.132006]  [<ffffffffa2824870>] ? retint_kernel+0x22/0x2c
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006]  [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006]  [<ffffffffa20cc688>] _do_fork+0xa8/0x3e0
[ 3688.132006]  [<ffffffffa281d613>] ? __schedule+0x5a3/0xc10
[ 3688.132006]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006]  [<ffffffffa20cc9f4>] kernel_thread+0x34/0x50
[ 3688.132006]  [<ffffffffa20f5445>] kthreadd+0x145/0x190
[ 3688.132006]  [<ffffffffa20f5300>] ? kthread_create_on_cpu+0x70/0x70
[ 3688.132006]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.132006]  [<ffffffffa20f5300>] ? kthread_create_on_cpu+0x70/0x70

This is the full output from the point when the hung_tasks_timeout kicks in:

[ 3840.183025] INFO: task init:1 blocked for more than 120 seconds.
[ 3840.186070]       Tainted: P           O    4.3.3-hardened-r4 #5
[ 3840.188901] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3840.192149] init            D ffff88003e1085b0     8     1      0 0x00000000
[ 3840.194532]  ffffc9000000b3d8 0000000000000086 ffffffffa2100e10 ffffc9000000b418
[ 3840.196355]  ffffc9000041b8c8 ffff88003d9fcb40 ffff88003e108000 ffffc9000000b3b8
[ 3840.198445]  ffffffffc000ca27 725cb22d39b5ad1c ffff88003e108a78 ffff88003d32c290
[ 3840.201775] Call Trace:
[ 3840.203391]  [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3840.205474]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3840.207856]  [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3840.209675]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3840.211237]  [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3840.212601]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3840.214369]  [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3840.215928]  [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3840.217440]  [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3840.218978]  [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3840.221112]  [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3840.223576]  [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3840.225914]  [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3840.228597]  [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3840.230685]  [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3840.232765]  [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3840.234887]  [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3840.237144]  [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3840.239768]  [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3840.241881]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3840.244051]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3840.246096]  [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3840.248417]  [<ffffffffa21dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 3840.250670]  [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3840.252953]  [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3840.254742]  [<ffffffffa21d7940>] __do_page_cache_readahead+0x130/0x290
[ 3840.257130]  [<ffffffffa21c91dc>] ? find_get_entry+0x6c/0xa0
[ 3840.258995]  [<ffffffffa21c93ad>] ? pagecache_get_page+0x2d/0x1c0
[ 3840.261129]  [<ffffffffa21cbc53>] filemap_fault+0x3a3/0x460
[ 3840.263339]  [<ffffffffa21fb41f>] __do_fault+0x7f/0x130
[ 3840.265459]  [<ffffffffa21fff05>] handle_mm_fault+0xb75/0x1750
[ 3840.267751]  [<ffffffffa204d930>] __do_page_fault+0x220/0x6d0
[ 3840.269739]  [<ffffffffa204de99>] trace_do_page_fault+0x49/0x150
[ 3840.271995]  [<ffffffffa2047dac>] do_async_page_fault+0x2c/0xa0
[ 3840.274271]  [<ffffffffa2825c68>] async_page_fault+0x28/0x30
[ 3840.276632] Sending NMI to all CPUs:
[ 3840.279229] NMI backtrace for cpu 0
[ 3840.280730] CPU: 0 PID: 20708 Comm: z_wr_int_1 Tainted: P           O    4.3.3-hardened-r4 #5
[ 3840.283954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 3840.288179] task: ffff88003c430000 ti: ffff88003c430a78 task.ti: ffff88003c430a78
[ 3840.291176] RIP: 0010:[<ffffffffa211d3fa>]  [<ffffffffa211d3fa>] do_raw_spin_lock+0x1a/0x130
[ 3840.294765] RSP: 0000:ffffc90000283380  EFLAGS: 00000246
[ 3840.297178] RAX: ffffffffa21f7bd6 RBX: ffff88003e00c880 RCX: 0000000000000000
[ 3840.300101] RDX: ffff88003c430000 RSI: 0000000000000000 RDI: ffff88003e00c880
[ 3840.302874] RBP: ffffc900002833a0 R08: 28f5c28f5c28f5c3 R09: 28f5c28f5c28f5c3
[ 3840.305239] R10: 0000000000000000 R11: 0000000000000400 R12: 0000000000000000
[ 3840.308015] R13: ffffffffffffffff R14: 0000000000000000 R15: 0000000000000000
[ 3840.310648] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 3840.313657] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3840.316010] CR2: 000003938ee1ab00 CR3: 0000000022834000 CR4: 00000000001606f0
[ 3840.319458] Stack:
[ 3840.320397]  ffff88003e00c880 0000000000000000 ffffffffffffffff 0000000000000000
[ 3840.323610]  ffffc900002833b8 ffffffffa282303f ffff88003e00c880 ffffc900002833e0
[ 3840.326440]  ffffffffa21f7bd6 0000000000000000 ffff88003e0195c0 0000000000000000
[ 3840.329039] Call Trace:
[ 3840.329949]  [<ffffffffa282303f>] _raw_spin_lock+0x1f/0x30
[ 3840.332366]  [<ffffffffa21f7bd6>] __list_lru_count_one.isra.2+0x26/0x80
[ 3840.334428]  [<ffffffffa21f7c63>] list_lru_count_one+0x33/0x40
[ 3840.336595]  [<ffffffffa2248db0>] super_cache_count+0x80/0xf0
[ 3840.338456]  [<ffffffffa21de21e>] shrink_slab.part.40+0x16e/0x610
[ 3840.340760]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3840.343015]  [<ffffffffa2162224>] ? css_next_descendant_pre+0x24/0x60
[ 3840.345145]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3840.347340]  [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3840.349651]  [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3840.352125]  [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3840.354710]  [<ffffffffa22267ed>] new_slab+0x31d/0x430
[ 3840.356673]  [<ffffffffa282311e>] ? _raw_spin_unlock_irqrestore+0x2e/0x50
[ 3840.359313]  [<ffffffffa222856c>] __slab_alloc.constprop.70+0x34c/0x460
[ 3840.361975]  [<ffffffffa2821278>] ? mutex_lock+0x28/0x40
[ 3840.363886]  [<ffffffffc0003749>] ? spl_kmem_cache_alloc+0xa9/0xc80 [spl]
[ 3840.366388]  [<ffffffffa2100ea4>] ? preempt_count_add+0x54/0xa0
[ 3840.368560]  [<ffffffffa28210ca>] ? __mutex_lock_slowpath+0x21a/0x3a0
[ 3840.371065]  [<ffffffffa2821278>] ? mutex_lock+0x28/0x40
[ 3840.373175]  [<ffffffffa2229c97>] kmem_cache_alloc+0x1e7/0x240
[ 3840.374868]  [<ffffffffc0003749>] ? spl_kmem_cache_alloc+0xa9/0xc80 [spl]
[ 3840.377382]  [<ffffffffc0003749>] spl_kmem_cache_alloc+0xa9/0xc80 [spl]
[ 3840.379891]  [<ffffffffa28230cf>] ? _raw_spin_unlock+0x1f/0x40
[ 3840.382420]  [<ffffffffc047a76a>] zio_create+0x8a/0x660 [zfs]
[ 3840.384796]  [<ffffffffc047b97b>] zio_free_sync+0x11b/0x210 [zfs]
[ 3840.386817]  [<ffffffffc047c913>] zio_free.part.7+0x93/0xf0 [zfs]
[ 3840.389197]  [<ffffffffc047c99b>] zio_free+0x2b/0x40 [zfs]
[ 3840.391378]  [<ffffffffc03ead15>] dsl_free+0x25/0x40 [zfs]
[ 3840.393404]  [<ffffffffc03ce71e>] dsl_dataset_block_kill+0x31e/0x8b0 [zfs]
[ 3840.395408]  [<ffffffffc038b4c4>] ? arc_change_state+0x3b4/0x840 [zfs]
[ 3840.397154]  [<ffffffffc03a77ae>] dbuf_write_done+0x43e/0x860 [zfs]
[ 3840.398851]  [<ffffffffc0390be6>] arc_write_done+0xd6/0x700 [zfs]
[ 3840.400470]  [<ffffffffc04802c5>] zio_done+0x475/0x18c0 [zfs]
[ 3840.402014]  [<ffffffffc00013c1>] ? spl_kmem_free+0x31/0x50 [spl]
[ 3840.403629]  [<ffffffffc042d1b4>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
[ 3840.405370]  [<ffffffffc0480a7b>] zio_done+0xc2b/0x18c0 [zfs]
[ 3840.406949]  [<ffffffffc0480a7b>] zio_done+0xc2b/0x18c0 [zfs]
[ 3840.408481]  [<ffffffffc0478b6d>] zio_execute+0xed/0x2f0 [zfs]
[ 3840.410015]  [<ffffffffc0006bb2>] taskq_thread+0x2a2/0x5c0 [spl]
[ 3840.411591]  [<ffffffffa20ffb10>] ? wake_up_q+0x80/0x80
[ 3840.412991]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3840.415567]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3840.417502]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.419842]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3840.421999]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.424294] Code: 25 bd ee 5d e8 6e a3 0a 00 e8 68 a9 33 00 eb bc 90 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 48 89 fb 81 7b 04 ad 4e ad de <0f> 85 bf 00 00 00 65 48 8b 04 25 80 9c 00 00 48 39 43 10 0f 84 
[ 3840.431241] NMI backtrace for cpu 1
[ 3840.432805] CPU: 1 PID: 34 Comm: khungtaskd Tainted: P           O    4.3.3-hardened-r4 #5
[ 3840.435788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 3840.439461] task: ffff88003e2b9580 ti: ffff88003e2b9ff8 task.ti: ffff88003e2b9ff8
[ 3840.442296] RIP: 0010:[<ffffffffa20435c9>]  [<ffffffffa20435c9>] flat_send_IPI_mask+0x79/0xc0
[ 3840.445148] RSP: 0000:ffffc90000123da0  EFLAGS: 00000046
[ 3840.447425] RAX: 0000000000000000 RBX: 0000000000000c00 RCX: 0000000000000000
[ 3840.449795] RDX: 0000000000000c00 RSI: 0000000000000002 RDI: 0000000000000286
[ 3840.452726] RBP: ffffc90000123dc0 R08: 0000000000000000 R09: 0000000000000000
[ 3840.455461] R10: ffffc90000008000 R11: ffff88003c4b6f00 R12: 0000000000000007
[ 3840.458220] R13: 0000000000000002 R14: 00000000000003ff R15: 00000000003fffff
[ 3840.460539] FS:  0000000000000000(0000) GS:ffff88003fc80000(0000) knlGS:0000000000000000
[ 3840.463425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3840.465811] CR2: 00000398d34db620 CR3: 0000000022836000 CR4: 00000000001606f0
[ 3840.468270] Stack:
[ 3840.468993]  0000000000000286 ffffffffa2cc8a20 0000000000000001 ffffffffa203fe80
[ 3840.471389]  ffffc90000123dd8 ffffffffa203feaf 000000000000f100 ffffc90000123e18
[ 3840.473573]  ffffffffa245ccb4 0000000000000000 0000000000000001 ffff88003e108000
[ 3840.475532] Call Trace:
[ 3840.476356]  [<ffffffffa203fe80>] ? irq_force_complete_move+0x40/0x40
[ 3840.478089]  [<ffffffffa203feaf>] nmi_raise_cpu_backtrace+0x2f/0x40
[ 3840.479835]  [<ffffffffa245ccb4>] nmi_trigger_all_cpu_backtrace+0x254/0x260
[ 3840.481747]  [<ffffffffa203ff3b>] arch_trigger_all_cpu_backtrace+0x1b/0x30
[ 3840.483610]  [<ffffffffa217cc00>] watchdog+0x320/0x3a0
[ 3840.485043]  [<ffffffffa217c8e0>] ? reset_hung_task_detector+0x20/0x20
[ 3840.486795]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3840.488212]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.489955]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3840.491428]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.493183] Code: c3 5f ff 80 e6 10 75 f2 44 89 e2 c1 e2 18 89 14 25 10 c3 5f ff 44 89 ea 09 da 80 cf 04 41 83 fd 02 0f 44 d3 89 14 25 00 c3 5f ff <57> 9d 0f 1f 44 00 00 48 83 c4 08 5b 41 5c 41 5d 5d 48 0f ba 2c 
[ 3840.498952] NMI backtrace for cpu 2
[ 3840.500535] CPU: 2 PID: 20705 Comm: z_wr_iss Tainted: P           O    4.3.3-hardened-r4 #5
[ 3840.503730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 3840.508015] task: ffff88003e2beb80 ti: ffff88003e2bf5f8 task.ti: ffff88003e2bf5f8
[ 3840.510738] RIP: 0010:[<ffffffffa211d432>]  [<ffffffffa211d432>] do_raw_spin_lock+0x52/0x130
[ 3840.514231] RSP: 0000:ffffc9000026b480  EFLAGS: 00000246
[ 3840.516521] RAX: 0000000000000000 RBX: ffff88003db23bc0 RCX: 0000000000000000
[ 3840.519215] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88003db23bc0
[ 3840.521796] RBP: ffffc9000026b4a0 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
[ 3840.524350] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 3840.527253] R13: ffffffffffffffff R14: 0000000000000000 R15: 0000000000000000
[ 3840.530079] FS:  0000000000000000(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[ 3840.533213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3840.535320] CR2: 00000072ed100648 CR3: 0000000022838000 CR4: 00000000001606f0
[ 3840.537490] Stack:
[ 3840.538261]  ffff88003db23bc0 0000000000000000 ffffffffffffffff 0000000000000000
[ 3840.540295]  ffffc9000026b4b8 ffffffffa282303f ffff88003db23bc0 ffffc9000026b4e0
[ 3840.542279]  ffffffffa21f7bd6 0000000000000000 ffff88003cd815c0 0000000000000000
[ 3840.544255] Call Trace:
[ 3840.545094]  [<ffffffffa282303f>] _raw_spin_lock+0x1f/0x30
[ 3840.546636]  [<ffffffffa21f7bd6>] __list_lru_count_one.isra.2+0x26/0x80
[ 3840.548767]  [<ffffffffa21f7c63>] list_lru_count_one+0x33/0x40
[ 3840.550409]  [<ffffffffa2248db0>] super_cache_count+0x80/0xf0
[ 3840.552008]  [<ffffffffa21de21e>] shrink_slab.part.40+0x16e/0x610
[ 3840.554466]  [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3840.556233]  [<ffffffffa2162224>] ? css_next_descendant_pre+0x24/0x60
[ 3840.558006]  [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3840.559533]  [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3840.561223]  [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3840.562921]  [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3840.564922]  [<ffffffffa22267ed>] new_slab+0x31d/0x430
[ 3840.566540]  [<ffffffffa222856c>] __slab_alloc.constprop.70+0x34c/0x460
[ 3840.568335]  [<ffffffffc0003749>] ? spl_kmem_cache_alloc+0xa9/0xc80 [spl]
[ 3840.570151]  [<ffffffffa2229c97>] kmem_cache_alloc+0x1e7/0x240
[ 3840.571783]  [<ffffffffc0003749>] ? spl_kmem_cache_alloc+0xa9/0xc80 [spl]
[ 3840.573891]  [<ffffffffc0003749>] spl_kmem_cache_alloc+0xa9/0xc80 [spl]
[ 3840.575766]  [<ffffffffc042dd47>] ? vdev_queue_io_to_issue+0x157/0xcb0 [zfs]
[ 3840.577827]  [<ffffffffc0414a9e>] ? spa_config_held+0x8e/0xb0 [zfs]
[ 3840.579675]  [<ffffffffc047a76a>] zio_create+0x8a/0x660 [zfs]
[ 3840.581627]  [<ffffffffc047c1d2>] zio_vdev_child_io+0xf2/0x190 [zfs]
[ 3840.583499]  [<ffffffffc042ca30>] ? vdev_config_sync+0x1e0/0x1e0 [zfs]
[ 3840.585321]  [<ffffffffc042d5bf>] vdev_mirror_io_start+0xaf/0x1d0 [zfs]
[ 3840.587130]  [<ffffffffc042ca30>] ? vdev_config_sync+0x1e0/0x1e0 [zfs]
[ 3840.588953]  [<ffffffffc047f283>] zio_vdev_io_start+0xe3/0x4d0 [zfs]
[ 3840.590701]  [<ffffffffc047ca86>] zio_nowait+0xd6/0x2b0 [zfs]
[ 3840.592268]  [<ffffffffc042d5c7>] vdev_mirror_io_start+0xb7/0x1d0 [zfs]
[ 3840.594048]  [<ffffffffc042ca30>] ? vdev_config_sync+0x1e0/0x1e0 [zfs]
[ 3840.595839]  [<ffffffffc047f415>] zio_vdev_io_start+0x275/0x4d0 [zfs]
[ 3840.597852]  [<ffffffffc0478b6d>] zio_execute+0xed/0x2f0 [zfs]
[ 3840.599519]  [<ffffffffc0006bb2>] taskq_thread+0x2a2/0x5c0 [spl]
[ 3840.601197]  [<ffffffffa20ffb10>] ? wake_up_q+0x80/0x80
[ 3840.602697]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3840.604589]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3840.606019]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.607790]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3840.609273]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.611069] Code: 00 00 48 39 43 10 0f 84 c0 00 00 00 65 8b 05 de bc ee 5d 39 43 08 0f 84 c4 00 00 00 8b 03 85 c0 75 0d ba 01 00 00 00 f0 0f b1 13 <85> c0 74 63 4c 69 35 8f 70 ae 00 e8 03 00 00 45 31 e4 41 bd 01 
[ 3840.617098] Kernel panic - not syncing: hung_task: blocked tasks
[ 3840.618010] CPU: 1 PID: 34 Comm: khungtaskd Tainted: P           O    4.3.3-hardened-r4 #5
[ 3840.618010] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 3840.618010]  ffff880000000002 23d7b990cd837547 0000000000000000 ffffc90000123e40
[ 3840.618010]  ffffc90000123da0 ffffffffa2457d93 ffffffffa2a728a9 ffffc90000123e30
[ 3840.618010]  ffffffffa21c7497 0000000000000008 ffffc90000123e40 ffffc90000123dd0
[ 3840.618010] Call Trace:
[ 3840.618010]  [<ffffffffa2457d93>] dump_stack+0x4e/0x8b
[ 3840.618010]  [<ffffffffa21c7497>] panic+0xd7/0x221
[ 3840.618010]  [<ffffffffa217cc0e>] watchdog+0x32e/0x3a0
[ 3840.618010]  [<ffffffffa217c8e0>] ? reset_hung_task_detector+0x20/0x20
[ 3840.618010]  [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3840.618010]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.618010]  [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3840.618010]  [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3840.618010] Kernel Offset: 0x21000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3840.618010] Rebooting in 60 seconds..

@perfinion
Copy link
Contributor Author

According to gdb, spl_kmem_cache_alloc+0xa9/0xc80 [spl] in the z_wr_* stacks points to https://github.com/zfsonlinux/spl/blob/b4ad50ac5f16de9220452f0da493e67c060d701b/module/spl/spl-kmem-cache.c#L1419

Looks like they are stuck trying to allocate memory. I assume the OOM killer should kick in to free up memory but it isnt. When I try to trigger the hang, about half the time the OOM killer will kill my eatram processes and things will immediately return to normal so it takes a few tries.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 8, 2016

It seems kthreadd doing direct reclaim into zfs causing deadlock.

Edit:
Unfortunately there doesn't seem to be a way to pass in GFP flags to kthreadd.

@behlendorf
Copy link
Contributor

Yes indeed, now what if anything can we do about it.

@perfinion
Copy link
Contributor Author

I just tried with spl_taskq_thread_dynamic=0, and was still able to hit this deadlock unfortunately :(.
tuxoko's comment made me think that would be the case. It was definitely a lot harder to hit tho, most of the time the OOM killer freed stuff up. I could frequently get it to hang for 5-10 seconds and then it would manage to finish and come back to life. but finally got a full hang so the machine rebooted. so =0 is not a complete workaround, it just lowers the likelyhood.

I dumped the stacks with sysrq+w and t. Would it be useful to see the stacktraces for the taskq_dynamic=0 case as well? I can post them if they help, if not they are quite long so dont want to spam the bug.

@behlendorf
Copy link
Contributor

@perfinion yes, I'd definitely like to see the stacks please add them to the issue. It's going to have to be something related too but slightly different from the previous deadlock which depended on the dynamic taskq behavior to deadlock.

@perfinion
Copy link
Contributor Author

@behlendorf this is the entire output, first sysctl+w, then t then the hung tasks panic stuff

Connected to domain zfstest
Escape character is ^]
[ 2017.485790] sysrq: SysRq : Show Blocked State
[ 2017.486750]   task                        PC stack   pid father
[ 2017.486750] khugepaged      D ffffffffba107ef0 11472    37      2 0x00000000
[ 2017.486750]  ffffc9000013b5d8 0000000000000046 ffffffffba100e10 ffffc9000013b618
[ 2017.486750]  ffffc90007fe3428 ffff88003e1e2b00 ffff88003e2bb5c0 ffffc9000013b5b8
[ 2017.486750]  ffffffffc000ca27 74b30fc52c25acd9 ffff88003e2bc038 ffff880033264290
[ 2017.486750] Call Trace:
[ 2017.486750]  [<ffffffffba100e10>] ? get_parent_ip+0x10/0x50
[ 2017.486750]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.486750]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.486750]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.486750]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.486750]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.486750]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.486750]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.486750]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.486750]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.486750]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.486750]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.486750]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.486750]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.486750]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.486750]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.486750]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.486750]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.486750]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.486750]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.486750]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.486750]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.486750]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.486750]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.486750]  [<ffffffffba22eed5>] khugepaged+0x475/0x16a0
[ 2017.486750]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.486750]  [<ffffffffba22ea60>] ? maybe_pmd_mkwrite+0x40/0x40
[ 2017.486750]  [<ffffffffba22ea60>] ? maybe_pmd_mkwrite+0x40/0x40
[ 2017.486750]  [<ffffffffba0f4a0a>] kthread+0x10a/0x120
[ 2017.486750]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.486750]  [<ffffffffba8240ce>] ret_from_fork+0x3e/0x70
[ 2017.486750]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.486750] kswapd0         D ffff88003da61070 12000   130      2 0x00000000
[ 2017.486750]  ffffc90000423888 0000000000000046 ffffffffba100e10 ffffc900004238c8
[ 2017.486750]  ffffc9000045b428 ffff88003c5c60c0 ffff88003da60ac0 ffffc90000423868
[ 2017.486750]  ffffffffc000ca27 30e2b1b6fb94ed93 ffff88003da61538 ffff880033264290
[ 2017.486750] Call Trace:
[ 2017.486750]  [<ffffffffba100e10>] ? get_parent_ip+0x10/0x50
[ 2017.486750]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.486750]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.486750]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.486750]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.486750]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.486750]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.486750]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.486750]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.486750]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.486750]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.486750]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.486750]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.486750]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.486750]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.486750]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.486750]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.486750]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.486750]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.486750]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.486750]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.486750]  [<ffffffffba1e1f5f>] kswapd+0x53f/0x9a0
[ 2017.486750]  [<ffffffffba1e1a20>] ? mem_cgroup_shrink_node_zone+0x200/0x200
[ 2017.486750]  [<ffffffffba0f4a0a>] kthread+0x10a/0x120
[ 2017.486750]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.486750]  [<ffffffffba8240ce>] ret_from_fork+0x3e/0x70
[ 2017.486750]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.486750] bash            D ffff88003fd12140     8  1200   1167 0x00000000
[ 2017.486750]  ffffc90003f83528 0000000000000086 ffff880033264298 0000000000000000
[ 2017.486750]  0000000000000002 ffff880000858000 ffff88003da65600 0000000000000001
[ 2017.486750]  ffffffffc000ca27 b65e54bbef5d0d55 ffff88003da66078 ffff880033264290
[ 2017.486750] Call Trace:
[ 2017.486750]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.486750]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.486750]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.486750]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.486750]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.486750]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.486750]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.486750]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.486750]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.486750]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.486750]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.486750]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.486750]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.486750]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.486750]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.486750]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.486750]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.486750]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.486750]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.486750]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.486750]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.486750]  [<ffffffffba1dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 2017.486750]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba1d2db4>] alloc_kmem_pages_node+0x44/0xf0
[ 2017.700101]  [<ffffffffba0ca131>] copy_process+0x261/0x2610
[ 2017.700101]  [<ffffffffba0cc688>] _do_fork+0xa8/0x3e0
[ 2017.700101]  [<ffffffffba0ccaa7>] SyS_clone+0x37/0x50
[ 2017.700101]  [<ffffffffba823d30>] entry_SYSCALL_64_fastpath+0x12/0x8a
[ 2017.700101]  [<ffffffffba823d67>] ? entry_SYSCALL_64_fastpath+0x49/0x8a
[ 2017.700101] arc_prune       D ffff88003fd12140 12160  1310      2 0x00000000
[ 2017.700101]  ffffc900004639a8 0000000000000046 ffff880033264298 0000000000000000
[ 2017.700101]  0000000000000002 ffff88003374d600 ffff880036282040 0000000000000001
[ 2017.700101]  ffffffffc000ca27 3d9d3ad8c3384a8d ffff880036282ab8 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffc0434d12>] zfs_sb_prune+0x1a2/0x1b0 [zfs]
[ 2017.700101]  [<ffffffffc0461ddb>] zpl_prune_sb+0x3b/0x60 [zfs]
[ 2017.700101]  [<ffffffffc0461da0>] ? zpl_sync_fs+0xf0/0xf0 [zfs]
[ 2017.700101]  [<ffffffffc0365320>] arc_prune_task+0x30/0xb0 [zfs]
[ 2017.700101]  [<ffffffffc0006bb2>] taskq_thread+0x2a2/0x5c0 [spl]
[ 2017.700101]  [<ffffffffba0ffb10>] ? wake_up_q+0x80/0x80
[ 2017.700101]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 2017.700101]  [<ffffffffba0f4a0a>] kthread+0x10a/0x120
[ 2017.700101]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.700101]  [<ffffffffba8240ce>] ret_from_fork+0x3e/0x70
[ 2017.700101]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.700101] arc_reclaim     D ffffffffba107ef0 12912  1312      2 0x00000000
[ 2017.700101]  ffffc9000046bcc8 0000000000000046 0000000000000001 ffffffffbaa72569
[ 2017.700101]  0000000000000001 ffff88003e1e2040 ffff880036282b00 0000000000000046
[ 2017.700101]  ffffffffba81e0e8 8616e62add0b0fac ffff880036283578 00000000ffffffff
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffba81e0e8>] ? schedule_preempt_disabled+0x18/0x30
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffba81e0e8>] schedule_preempt_disabled+0x18/0x30
[ 2017.700101]  [<ffffffffba820fe8>] __mutex_lock_slowpath+0x138/0x3a0
[ 2017.700101]  [<ffffffffba821278>] mutex_lock+0x28/0x40
[ 2017.700101]  [<ffffffffc000c662>] __cv_timedwait_common+0x162/0x310 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000c87a>] __cv_timedwait_sig+0x2a/0x40 [spl]
[ 2017.700101]  [<ffffffffc036f714>] arc_reclaim_thread+0x174/0x410 [zfs]
[ 2017.700101]  [<ffffffffc036f5a0>] ? arc_shrink+0x130/0x130 [zfs]
[ 2017.700101]  [<ffffffffc00051d5>] thread_generic_wrapper+0x95/0xe0 [spl]
[ 2017.700101]  [<ffffffffc0005140>] ? __thread_exit+0x20/0x20 [spl]
[ 2017.700101]  [<ffffffffba0f4a0a>] kthread+0x10a/0x120
[ 2017.700101]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.700101]  [<ffffffffba8240ce>] ret_from_fork+0x3e/0x70
[ 2017.700101]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.700101] z_iput          D ffff88003fc92140 12192  2299      2 0x00000000
[ 2017.700101]  ffffc90003913a40 0000000000000046 ffff880033264298 0000000000000000
[ 2017.700101]  0000000000000001 ffff88003b4cb5c0 ffff88003c5c5600 0000000000000001
[ 2017.700101]  ffffffffc000ca27 bb8495ef514c5c64 ffff88003c5c6078 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc0387b93>] dmu_free_long_range+0x1e3/0x350 [zfs]
[ 2017.700101]  [<ffffffffc04218d0>] zfs_rmnode+0x2f0/0x440 [zfs]
[ 2017.700101]  [<ffffffffc044376b>] ? zfs_znode_hold_exit+0x15b/0x200 [zfs]
[ 2017.700101]  [<ffffffffc04477b2>] zfs_zinactive+0xf2/0x150 [zfs]
[ 2017.700101]  [<ffffffffc043de8b>] zfs_inactive+0x7b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26a03c>] iput+0x1cc/0x230
[ 2017.700101]  [<ffffffffc0006bb2>] taskq_thread+0x2a2/0x5c0 [spl]
[ 2017.700101]  [<ffffffffba0ffb10>] ? wake_up_q+0x80/0x80
[ 2017.700101]  [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 2017.700101]  [<ffffffffba0f4a0a>] kthread+0x10a/0x120
[ 2017.700101]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.700101]  [<ffffffffba8240ce>] ret_from_fork+0x3e/0x70
[ 2017.700101]  [<ffffffffba0f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2017.700101] python3.4       D ffff88003606f130     8  2382   1327 0x00000000
[ 2017.700101]  ffffc90007fe33e8 0000000000000086 ffffffffba100e10 ffffc90007fe3428
[ 2017.700101]  ffffc900004639e8 ffff880036069580 ffff88003606eb80 ffffc90007fe33c8
[ 2017.700101]  ffffffffc000ca27 efe104c02852668e ffff88003606f5f8 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffba100e10>] ? get_parent_ip+0x10/0x50
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba180ff0>] ? __delayacct_blkio_end+0x30/0x50
[ 2017.700101]  [<ffffffffba1d7940>] __do_page_cache_readahead+0x130/0x290
[ 2017.700101]  [<ffffffffba1c91dc>] ? find_get_entry+0x6c/0xa0
[ 2017.700101]  [<ffffffffba1c93ad>] ? pagecache_get_page+0x2d/0x1c0
[ 2017.700101]  [<ffffffffba1cbc53>] filemap_fault+0x3a3/0x460
[ 2017.700101]  [<ffffffffba1fb41f>] __do_fault+0x7f/0x130
[ 2017.700101]  [<ffffffffba1fff05>] handle_mm_fault+0xb75/0x1750
[ 2017.700101]  [<ffffffffba04d930>] __do_page_fault+0x220/0x6d0
[ 2017.700101]  [<ffffffffba04de99>] trace_do_page_fault+0x49/0x150
[ 2017.700101]  [<ffffffffba047dac>] do_async_page_fault+0x2c/0xa0
[ 2017.700101]  [<ffffffffba825c68>] async_page_fault+0x28/0x30
[ 2017.700101] rsync           D ffff88003fd12140     8  2507   2506 0x00000000
[ 2017.700101]  ffffc9000041b310 0000000000000086 ffff880033264298 0000000000000000
[ 2017.700101]  0000000000000002 ffff88003b4cab00 ffff880000858000 0000000000000001
[ 2017.700101]  ffffffffc000ca27 9f5ecd7531d692c0 ffff880000858a78 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc04214e7>] zfs_purgedir+0x177/0x270 [zfs]
[ 2017.700101]  [<ffffffffba8210ca>] ? __mutex_lock_slowpath+0x21a/0x3a0
[ 2017.700101]  [<ffffffffba8230cf>] ? _raw_spin_unlock+0x1f/0x40
[ 2017.700101]  [<ffffffffc04433b0>] ? zfs_znode_held+0x100/0x170 [zfs]
[ 2017.700101]  [<ffffffffba821278>] ? mutex_lock+0x28/0x40
[ 2017.700101]  [<ffffffffba100e10>] ? get_parent_ip+0x10/0x50
[ 2017.700101]  [<ffffffffba100ea4>] ? preempt_count_add+0x54/0xa0
[ 2017.700101]  [<ffffffffba8210ca>] ? __mutex_lock_slowpath+0x21a/0x3a0
[ 2017.700101]  [<ffffffffba47f6d0>] ? __list_del_entry+0x10/0x30
[ 2017.700101]  [<ffffffffba47f700>] ? list_del+0x10/0x30
[ 2017.700101]  [<ffffffffc03d2f25>] ? refcount_destroy_many+0xd5/0x220 [zfs]
[ 2017.700101]  [<ffffffffba22780b>] ? kmem_cache_free+0x3b/0x250
[ 2017.700101]  [<ffffffffc04218a8>] zfs_rmnode+0x2c8/0x440 [zfs]
[ 2017.700101]  [<ffffffffc044376b>] ? zfs_znode_hold_exit+0x15b/0x200 [zfs]
[ 2017.700101]  [<ffffffffc04477b2>] zfs_zinactive+0xf2/0x150 [zfs]
[ 2017.700101]  [<ffffffffc043de8b>] zfs_inactive+0x7b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba200392>] handle_mm_fault+0x1002/0x1750
[ 2017.700101]  [<ffffffffba04d930>] __do_page_fault+0x220/0x6d0
[ 2017.700101]  [<ffffffffba04de99>] trace_do_page_fault+0x49/0x150
[ 2017.700101]  [<ffffffffba047dac>] do_async_page_fault+0x2c/0xa0
[ 2017.700101]  [<ffffffffba825c68>] async_page_fault+0x28/0x30
[ 2017.700101] htop            D ffff88003374bb70     8  2587   1320 0x00000000
[ 2017.700101]  ffffc900000e3428 0000000000000086 ffff880033264298 0000000000000246
[ 2017.700101]  ffffc90007fe3410 ffff880036282b00 ffff88003374b5c0 ffffffffba0ffa73
[ 2017.700101]  ffffffffc000ca27 29677d45d0ffc2b8 ffff88003374c038 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffba0ffa73>] ? wake_up_process+0x23/0x40
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba1d7940>] __do_page_cache_readahead+0x130/0x290
[ 2017.700101]  [<ffffffffba1c91dc>] ? find_get_entry+0x6c/0xa0
[ 2017.700101]  [<ffffffffba1c93ad>] ? pagecache_get_page+0x2d/0x1c0
[ 2017.700101]  [<ffffffffba1cbc53>] filemap_fault+0x3a3/0x460
[ 2017.700101]  [<ffffffffba1fb41f>] __do_fault+0x7f/0x130
[ 2017.700101]  [<ffffffffba1fff05>] handle_mm_fault+0xb75/0x1750
[ 2017.700101]  [<ffffffffba04d930>] __do_page_fault+0x220/0x6d0
[ 2017.700101]  [<ffffffffba04de99>] trace_do_page_fault+0x49/0x150
[ 2017.700101]  [<ffffffffba047dac>] do_async_page_fault+0x2c/0xa0
[ 2017.700101]  [<ffffffffba825c68>] async_page_fault+0x28/0x30
[ 2017.700101] rm              D ffff88003b4cb0b0     8  2619   1200 0x00000000
[ 2017.700101]  ffffc90000023b10 0000000000000086 ffff880033264298 0000000000000246
[ 2017.700101]  ffffc90007fe3410 ffff880036282040 ffff88003b4cab00 ffffffffba0ffa73
[ 2017.700101]  ffffffffc000ca27 1545fc70a7c98c78 ffff88003b4cb578 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffba0ffa73>] ? wake_up_process+0x23/0x40
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc0440699>] zfs_remove+0x189/0xa00 [zfs]
[ 2017.700101]  [<ffffffffba100e0a>] ? get_parent_ip+0xa/0x50
[ 2017.700101]  [<ffffffffc0460118>] zpl_unlink+0x68/0x170 [zfs]
[ 2017.700101]  [<ffffffffba257a71>] vfs_unlink+0x111/0x1a0
[ 2017.700101]  [<ffffffffba25aab1>] do_unlinkat+0x281/0x350
[ 2017.700101]  [<ffffffffba25b78f>] SyS_unlinkat+0x2f/0x50
[ 2017.700101]  [<ffffffffba823d30>] entry_SYSCALL_64_fastpath+0x12/0x8a
[ 2017.700101] rsync           D ffff88003fc92140     8  2620   1200 0x00000000
[ 2017.700101]  ffffc900000333c8 0000000000000086 ffff880033264298 0000000000000000
[ 2017.700101]  0000000000000001 ffff88003374eb80 ffff88003b4cb5c0 0000000000000001
[ 2017.700101]  ffffffffc000ca27 7363054d6b7c2ef5 ffff88003b4cc038 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc0421716>] zfs_rmnode+0x136/0x440 [zfs]
[ 2017.700101]  [<ffffffffc044376b>] ? zfs_znode_hold_exit+0x15b/0x200 [zfs]
[ 2017.700101]  [<ffffffffc04477b2>] zfs_zinactive+0xf2/0x150 [zfs]
[ 2017.700101]  [<ffffffffc043de8b>] zfs_inactive+0x7b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba1d7940>] __do_page_cache_readahead+0x130/0x290
[ 2017.700101]  [<ffffffffba1c91dc>] ? find_get_entry+0x6c/0xa0
[ 2017.700101]  [<ffffffffba1c93ad>] ? pagecache_get_page+0x2d/0x1c0
[ 2017.700101]  [<ffffffffba1cbc53>] filemap_fault+0x3a3/0x460
[ 2017.700101]  [<ffffffffba1fb41f>] __do_fault+0x7f/0x130
[ 2017.700101]  [<ffffffffba1fff05>] handle_mm_fault+0xb75/0x1750
[ 2017.700101]  [<ffffffffba04d930>] __do_page_fault+0x220/0x6d0
[ 2017.700101]  [<ffffffffba823d67>] ? entry_SYSCALL_64_fastpath+0x49/0x8a
[ 2017.700101]  [<ffffffffba04de99>] trace_do_page_fault+0x49/0x150
[ 2017.700101]  [<ffffffffba047dac>] do_async_page_fault+0x2c/0xa0
[ 2017.700101]  [<ffffffffba825c68>] async_page_fault+0x28/0x30
[ 2017.700101] rsync           D ffff88003374b0b0     8  2621   2620 0x00000000
[ 2017.700101]  ffffc9000018b568 0000000000000086 ffff880033264298 0000000000000246
[ 2017.700101]  ffffc90007fe3410 ffff88003e10e0c0 ffff88003374ab00 ffffffffba0ffa73
[ 2017.700101]  ffffffffc000ca27 b24d44a71592c336 ffff88003374b578 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffba0ffa73>] ? wake_up_process+0x23/0x40
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba200392>] handle_mm_fault+0x1002/0x1750
[ 2017.700101]  [<ffffffffba04d930>] __do_page_fault+0x220/0x6d0
[ 2017.700101]  [<ffffffffba04de99>] trace_do_page_fault+0x49/0x150
[ 2017.700101]  [<ffffffffba047dac>] do_async_page_fault+0x2c/0xa0
[ 2017.700101]  [<ffffffffba825c68>] async_page_fault+0x28/0x30
[ 2017.700101] eatram          D ffffffffba107ef0     8  2633   1200 0x00000000
[ 2017.700101]  ffffc9000045b3e8 0000000000000086 ffffffffba100e10 ffffc9000045b428
[ 2017.700101]  ffffc9000013b618 ffff88003e1e2b00 ffff88003374d600 ffffc9000045b3c8
[ 2017.700101]  ffffffffc000ca27 4512334b5eadc707 ffff88003374e078 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffba100e10>] ? get_parent_ip+0x10/0x50
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc0421716>] zfs_rmnode+0x136/0x440 [zfs]
[ 2017.700101]  [<ffffffffc044376b>] ? zfs_znode_hold_exit+0x15b/0x200 [zfs]
[ 2017.700101]  [<ffffffffc04477b2>] zfs_zinactive+0xf2/0x150 [zfs]
[ 2017.700101]  [<ffffffffc043de8b>] zfs_inactive+0x7b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba231a48>] do_huge_pmd_anonymous_page+0x128/0x500
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffba2008b1>] handle_mm_fault+0x1521/0x1750
[ 2017.700101]  [<ffffffffba8230cf>] ? _raw_spin_unlock+0x1f/0x40
[ 2017.700101]  [<ffffffffba1f9257>] ? follow_page_pte+0x217/0x390
[ 2017.700101]  [<ffffffffba1f94ad>] ? follow_page_mask+0xdd/0x340
[ 2017.700101]  [<ffffffffba1f9869>] __get_user_pages+0x159/0x5f0
[ 2017.700101]  [<ffffffffba2060fa>] ? vma_merge+0xba/0x340
[ 2017.700101]  [<ffffffffba1fa8c4>] populate_vma_page_range+0x74/0x90
[ 2017.700101]  [<ffffffffba1fa995>] __mm_populate+0xb5/0x150
[ 2017.700101]  [<ffffffffba2041fa>] SyS_mlockall+0x16a/0x1d0
[ 2017.700101]  [<ffffffffba823d30>] entry_SYSCALL_64_fastpath+0x12/0x8a
[ 2017.700101] eatram          D ffff88003fc92140     8  2634   1200 0x00000000
[ 2017.700101]  ffffc9000050b488 0000000000000086 ffff880033264298 0000000000000000
[ 2017.700101]  0000000000000001 ffff88003374b5c0 ffff88003374eb80 0000000000000001
[ 2017.700101]  ffffffffc000ca27 e811d06fba2a52ca ffff88003374f5f8 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc043dfab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba231a48>] do_huge_pmd_anonymous_page+0x128/0x500
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffba2008b1>] handle_mm_fault+0x1521/0x1750
[ 2017.700101]  [<ffffffffba8230cf>] ? _raw_spin_unlock+0x1f/0x40
[ 2017.700101]  [<ffffffffba1f9257>] ? follow_page_pte+0x217/0x390
[ 2017.700101]  [<ffffffffba1f94ad>] ? follow_page_mask+0xdd/0x340
[ 2017.700101]  [<ffffffffba1f9869>] __get_user_pages+0x159/0x5f0
[ 2017.700101]  [<ffffffffba2060fa>] ? vma_merge+0xba/0x340
[ 2017.700101]  [<ffffffffba1fa8c4>] populate_vma_page_range+0x74/0x90
[ 2017.700101]  [<ffffffffba1fa995>] __mm_populate+0xb5/0x150
[ 2017.700101]  [<ffffffffba2041fa>] SyS_mlockall+0x16a/0x1d0
[ 2017.700101]  [<ffffffffba823d30>] entry_SYSCALL_64_fastpath+0x12/0x8a
[ 2017.700101] eatram          D ffff88003fd12140     8  2635   1200 0x00000000
[ 2017.700101]  ffffc90000513458 0000000000000086 ffff880033264298 0000000000000000
[ 2017.700101]  0000000000000002 ffff88003da65600 ffff880033749580 0000000000000001
[ 2017.700101]  ffffffffc000ca27 07fd0528d0df1c0e ffff880033749ff8 ffff880033264290
[ 2017.700101] Call Trace:
[ 2017.700101]  [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba81dd4d>] schedule+0x3d/0x90
[ 2017.700101]  [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 2017.700101]  [<ffffffffba117e10>] ? wait_woken+0xa0/0xa0
[ 2017.700101]  [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 2017.700101]  [<ffffffffc03f61d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 2017.700101]  [<ffffffffc039bfe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 2017.700101]  [<ffffffffc039c0be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 2017.700101]  [<ffffffffc0421716>] zfs_rmnode+0x136/0x440 [zfs]
[ 2017.700101]  [<ffffffffc044376b>] ? zfs_znode_hold_exit+0x15b/0x200 [zfs]
[ 2017.700101]  [<ffffffffc04477b2>] zfs_zinactive+0xf2/0x150 [zfs]
[ 2017.700101]  [<ffffffffc043de8b>] zfs_inactive+0x7b/0x2b0 [zfs]
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffc0461c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 2017.700101]  [<ffffffffc047d580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 2017.700101]  [<ffffffffba26938f>] evict+0xbf/0x1a0
[ 2017.700101]  [<ffffffffba2694a8>] dispose_list+0x38/0x70
[ 2017.700101]  [<ffffffffba26a961>] prune_icache_sb+0x61/0x90
[ 2017.700101]  [<ffffffffba24a398>] super_cache_scan+0x278/0x450
[ 2017.700101]  [<ffffffffba1f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 2017.700101]  [<ffffffffba1de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 2017.700101]  [<ffffffffba1e0c65>] ? shrink_lruvec+0x625/0x720
[ 2017.700101]  [<ffffffffba1e0fee>] shrink_zone+0x28e/0x2c0
[ 2017.700101]  [<ffffffffba1e118b>] do_try_to_free_pages+0x16b/0x490
[ 2017.700101]  [<ffffffffba1e1591>] try_to_free_pages+0xe1/0x1c0
[ 2017.700101]  [<ffffffffba1d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 2017.700101]  [<ffffffffba231a48>] do_huge_pmd_anonymous_page+0x128/0x500
[ 2017.700101]  [<ffffffffba823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 2017.700101]  [<ffffffffba2008b1>] handle_mm_fault+0x1521/0x1750
[ 2017.700101]  [<ffffffffba8230cf>] ? _raw_spin_unlock+0x1f/0x40
[ 2017.700101]  [<ffffffffba1f9257>] ? follow_page_pte+0x217/0x390
[ 2017.700101]  [<ffffffffba1f94ad>] ? follow_page_mask+0xdd/0x340
[ 2017.700101]  [<ffffffffba1f9869>] __get_user_pages+0x159/0x5f0
[ 2017.700101]  [<ffffffffba2060fa>] ? vma_merge+0xba/0x340
[ 2017.700101]  [<ffffffffba1fa8c4>] populate_vma_page_range+0x74/0x90
[ 2017.700101]  [<ffffffffba1fa995>] __mm_populate+0xb5/0x150
[ 2017.700101]  [<ffffffffba2041fa>] SyS_mlockall+0x16a/0x1d0
[ 2017.700101]  [<ffffffffba823d30>] entry_SYSCALL_64_fastpath+0x12/0x8a

@perfinion
Copy link
Contributor Author

hang-master-4.txt

@behlendorf This is all the output, I didnt realize i could attach files instead.

@behlendorf
Copy link
Contributor

It looks like disabling the dynamic taskqs solved the initial problem as hoped. However, you quickly uncovered another way to get the txg_sync thread blocked and hang the system. In this case the txg_sync() thread is desperately trying to reclaim memory from the slab it but seems to be unable to do so, perhaps because there's nothing freeable. Possibly all in inodes which can't be released because they need to be part of a txg. It is a bit strange that the OOM killer hasn't run through.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 10, 2016

@behlendorf
http://lxr.free-electrons.com/source/mm/page_alloc.c#L2727
The problem with OOM not firing is that, for !__GFP_FS, it will retry forever without calling OOM. So when every direct reclaim goes into ZFS, and txg_sync cannot reclaim memory from other places, we stuck forever.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 10, 2016

@behlendorf
An interesting thing is that a lot of thread stuck at zfs_inactive when it try to get an tx for atime. It seems this is the only place we try to commit atime. I wonder why is that.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 10, 2016

@perfinion
Could you try #3441 and see how things goes. It doesn't fix the deadlocks, but it might make them less frequent.

@behlendorf
Copy link
Contributor

@tuxoko you raise a good point about the atime update, this is something we inherited from Illumos and may not need at all on Linux. For Linux inode updates are supposed to be written through the sops->dirty_inode and sops->write_inode callbacks which have been added to ZoL in 8780c53. To my knowledge there is no counterpart for this on Illumos so they perform atime updates in places like zfs_inactive.

There's an excellent chance that this atime update is redundant and can be disabled. That would be nice since it would not only help performance by remove one of the most likely deadlock locations.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 12, 2016

atime updating is really a mess. We have 3 places for atime: inode->i_atime, znode->z_atime and SA.
Also relatime doesn't work.

@perfinion
Copy link
Contributor Author

@behlendorf @tuxoko would mounting with noatime or lazyatime completely remove all the atime paths? if it does, i'll see if i can trigger it again then we know thats a good way to proceed and could work on removing the atime updates from zinactive.

I'll also test out the ABD patches soon when I get time and see how things go.

@behlendorf
Copy link
Contributor

@tuxoko for a long time now I've wanted to remove the atime from the znode along with all those other redundant fields. The values in the inode are always the authoritative values as far as the VFS is concerned. Take a look at the comment above zfs_inode_update().

Relatime should work if you set the property on the dataset, it may not honor the mount option though. @perfinion setting relatime won't remove all the possible deadlock paths but it may help. It would be an interesting thing to test.

zfs set relatime=on pool/fs

@tuxoko
Copy link
Contributor

tuxoko commented Feb 13, 2016

@behlendorf
A huge part of mess regarding atime comes from zfs_tstamp_update_setup, zfs_inode_update, and zfs_getattr, which behave inconsistently with those three values.

zfs_tstamp_update_setup clears z_atime_dirty unconditionally as long as you don't pass ATTR_ATIME. Which means every write(2) operation which only updates ctime and mtime will cause atime to be discard from earlier read(2) operation.

Also zfs_inode_update from write(2) will replace inode->i_atime with what's inside SA. But doesn't touch z_atime. So after read(2) and write(2). You'll have i_atime(stale), z_atime(new), SA(stale) and z_atime_dirty=0.

Now, if you do stat(2), zfs_getattr will actually replace i_atime with what's inside, z_atime. So you will have now you'll have i_atime(new), z_atime(new), SA(stale) and z_atime_dirty=0. These will all gone after umount. And you'll leave with a stale atime.

@perfinion
Copy link
Contributor Author

turns out the original tests were with relatime on. I did atime=off relatime=off now and got a hang too.

@tuxoko @behlendorf I tried ABD and also got the deadlock. Both these tests were with dynamic taskq's disabled. Interestingly, without abd when i first run a few eatram processes, the OOM killer gets them and it is a little more difficult to hit the deadlock. With ABD I hit the deadlock pretty fast. Either it was a fluke and I am just practiced at hitting it or allocating in a different way means its easier to hit?

I am not so sure anymore if atimes are as important, initially I could trigger with just reads if I was lucky but with taskq_dynamic=0, having a lot of writes makes hitting it a lot easier.

I am attaching both sets of stacktraces.
abd2-test1.txt
master-noatime.txt

@tuxoko
Copy link
Contributor

tuxoko commented Feb 17, 2016

One thing that might help is to set __GFP_NOFAIL when we are !__GFP_FS. That should force OOM killer to fire. But we also need txg_wait_open to be able to back out when it's been killed.

@perfinion
Copy link
Contributor Author

I tested this again a few weeks ago at commit 5c27b29 and it was very easy to trigger still, then just tested at commit dfbc863 and could no longer trigger this. It would slow down a lot when i ran a lot of things but it would always recover after a few seconds which is great! I have stacktraces from these tests which I can upload if they are helpful for verifying.

I was also looking through before and if I understand it correctly, it seems like dmu_tx_assign is designed to fail and deals with it correctly but txg_wait_open can never exit and dmu_tx_wait returns void too.
Even if txg_wait_open is fixed, next round will have TXG_WAITED which will still get stuck in the while() in dmu_tx_assign, is that potentially another problem? does that while loop need to back out as well?

@dweeezil
Copy link
Contributor

@perfinion The most interesting commits in the range you mention are ddae16a and 31b6111 assuming you're using a lot of xaatrs. It would be interesting to see the arcstats after running your memory-hogging program. Speaking of which, given the test system only has 1GiB of RAM, how much memory is the "a lot of ram" it's mlockall()'ing?

@perfinion
Copy link
Contributor Author

@dweeezil yeah I suspect those are the commits that helped the most but have not tested specifically around them. do you want me to?
As for eating up the ram, I would run the program many times malloc'ing 200MB each until it fails, then i'd eat 50MB at a time. and stop when the OOMkiller starts whacking things. I could hit the problem on my server(4GB ram) before without eating up the ram but this makes it much faster to hit. I just tried again in the VM and the processes are eating ~700MB/1G

localhost eatram # jobs
[1]   Running                 ./eatram $((1024 * 1024 * 200)) &
[2]   Running                 ./eatram $((1024 * 1024 * 200)) &
[4]   Running                 ./eatram $((1024 * 1024 * 200)) &
[6]   Running                 find /tank/ -exec cat {} + > /dev/null 2>&1 &
[7]   Running                 restorecon -rF /tank/test1/ 2> /dev/null &
[8]   Running                 ./eatram $((1024 * 1024 * 50)) &
[9]-  Running                 ./eatram $((1024 * 1024 * 50)) &
[10]+  Running                 find /tank/ > /dev/null 2>&1 &
localhost eatram # free -m
              total        used        free      shared  buff/cache   available
Mem:            988         770          30           0         187          32
Swap:             0           0           0
localhost eatram # zpool status; zpool list
  pool: tank
 state: ONLINE
  scan: scrub in progress since Mon Jul 18 07:23:09 2016
    109M scanned out of 598M at 1.63M/s, 0h5m to go
    0 repaired, 18.25% done
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        vda     ONLINE       0     0     0
        vdb     ONLINE       0     0     0

errors: No known data errors
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  1008M   594M   414M         -    90%    58%  1.00x  ONLINE  -
localhost eatram # arcstat.py 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
07:24:23   108    51     47    47   55     4   16    40   41    86M   32M  

After a long time, the programs finish and the system hasnt locked up. restorecon re-sets SELinux labels to it touches all the xattrs. here is arcstats:

localhost eatram # free -m
              total        used        free      shared  buff/cache   available
Mem:            988         801          33           0         153          35
Swap:             0           0           0
localhost eatram # arcstat.py 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
07:42:59     0     0      0     0    0     0    0     0    0    89M   89M  
localhost eatram # cat /proc/spl/kstat/zfs/arcstats 
7 1 0x01 94 4512 4543498836 2415644849730
name                            type data
hits                            4    1407265
misses                          4    2296094
demand_data_hits                4    601
demand_data_misses              4    942464
demand_metadata_hits            4    984754
demand_metadata_misses          4    1160504
prefetch_data_hits              4    0
prefetch_data_misses            4    438
prefetch_metadata_hits          4    421910
prefetch_metadata_misses        4    192688
mru_hits                        4    453425
mru_ghost_hits                  4    122672
mfu_hits                        4    540852
mfu_ghost_hits                  4    177683
deleted                         4    1214296
mutex_miss                      4    11265
evict_skip                      4    111199270
evict_not_enough                4    1695494
evict_l2_cached                 4    0
evict_l2_eligible               4    4191368704
evict_l2_ineligible             4    2189959168
evict_l2_skip                   4    0
hash_elements                   4    11994
hash_elements_max               4    42731
hash_collisions                 4    111375
hash_chains                     4    476
hash_chain_max                  4    5
p                               4    66091282
c                               4    94335064
c_min                           4    33554432
c_max                           4    518037504
size                            4    94169528
hdr_size                        4    7528360
data_size                       4    0
metadata_size                   4    74526720
other_size                      4    12114448
anon_size                       4    18432
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    6723584
mru_evictable_data              4    0
mru_evictable_metadata          4    3051520
mru_ghost_size                  4    66467840
mru_ghost_evictable_data        4    2246656
mru_ghost_evictable_metadata    4    64221184
mfu_size                        4    67784704
mfu_evictable_data              4    0
mfu_evictable_metadata          4    66830336
mfu_ghost_size                  4    18563072
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    18563072
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_writes_skip_toobig           4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    25789
memory_direct_count             4    4090
memory_indirect_count           4    17928
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    94169528
arc_meta_limit                  4    388528128
arc_meta_max                    4    134267952
arc_meta_min                    4    16777216
sync_wait_for_async             4    161
demand_hit_predictive_prefetch  4    2973
arc_need_free                   4    0
arc_sys_free                    4    16187392

here is eatram.c, I needed to mlockall and actually write to all the ram otherwise the overcommit kept giving more ram.

int main(int argc, char **argv)
{
    int ret;
    int size = atoi(argv[1]);
    int pid = getpid();
    char *p = malloc(size);
    char *pend = p + size;
    char *i = p;

    printf("eating ram pid=%d size=%d p=%p\n", pid, size, p);
    if (p == NULL)
        return 1;

    ret = mlockall(MCL_CURRENT);
    printf("mlock = %d\n", ret);

    while (i < pend)
        *i++ = pid++;

    printf("sleeping p = %p; pend = %p\n", p, pend);
    while(1)
        sleep(1000);
    return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Status: Inactive Not being actively updated Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants
@tuxoko @behlendorf @perfinion @dweeezil and others