iput_async() can deadlock in direct reclaim #3055

ryao · 2015-01-29T19:23:39Z

I spotted the following in the build bot output:

INFO: task zfs_iput_taskq/:619 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-504.3.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zfs_iput_task D 0000000000000003     0   619      2 0x00000080
 ffff8800343caca0 0000000000000046 ffff8800343cac20 0000000000000140
 ffff8800343cac40 ffffffff81041e98 ffff8800343cac20 ffffffff8152c126
 ffff8800343cac30 ffffffff8100bb8e ffff88007bd5e5f8 ffff8800343cbfd8
Call Trace:
 [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8152c126>] ? down_read+0x16/0x30
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8152b306>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffffa03333b8>] ? refcount_remove_many+0x158/0x230 [zfs]
 [<ffffffff8152ae2b>] mutex_lock+0x2b/0x50
 [<ffffffffa02d8655>] dbuf_find+0x85/0x220 [zfs]
 [<ffffffffa02df30f>] __dbuf_hold_impl+0x11f/0x9e0 [zfs]
 [<ffffffff811750ad>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa02dfc4d>] dbuf_hold_impl+0x7d/0xb0 [zfs]
 [<ffffffffa02e2c10>] dbuf_hold+0x20/0x30 [zfs]
 [<ffffffffa02ec1c7>] dmu_buf_hold_noread+0x87/0x220 [zfs]
 [<ffffffffa02ec39b>] dmu_buf_hold+0x3b/0x90 [zfs]
 [<ffffffffa036c341>] zap_get_leaf_byblk+0x81/0x740 [zfs]
 [<ffffffffa02ebfa5>] ? dmu_object_info_from_dnode+0x145/0x220 [zfs]
 [<ffffffffa036cad6>] zap_deref_leaf+0xd6/0x170 [zfs]
 [<ffffffffa036d0c7>] fzap_remove+0x37/0xb0 [zfs]
 [<ffffffffa0371ab4>] ? zap_name_alloc+0x84/0xe0 [zfs]
 [<ffffffffa037520b>] zap_remove_norm+0x1db/0x2c0 [zfs]
 [<ffffffff81296934>] ? snprintf+0x34/0x40
 [<ffffffffa0375303>] zap_remove+0x13/0x20 [zfs]
 [<ffffffffa036a8a1>] zap_remove_int+0x61/0x90 [zfs]
 [<ffffffffa037fe47>] zfs_rmnode+0x227/0x430 [zfs]
 [<ffffffffa03a7166>] zfs_zinactive+0x116/0x260 [zfs]
 [<ffffffffa03a1a8f>] zfs_inactive+0x7f/0x380 [zfs]
 [<ffffffffa03a6c0d>] ? zfs_inode_destroy+0x13d/0x1d0 [zfs]
 [<ffffffffa03c134e>] zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811ab96c>] clear_inode+0xac/0x140
 [<ffffffff811aba40>] dispose_list+0x40/0x120
 [<ffffffff811abd94>] shrink_icache_memory+0x274/0x2e0
 [<ffffffff8113d4ba>] shrink_slab+0x12a/0x1a0
 [<ffffffff8113f8c7>] do_try_to_free_pages+0x3f7/0x610
 [<ffffffff8113fcb2>] try_to_free_pages+0x92/0x120
 [<ffffffff811340be>] __alloc_pages_nodemask+0x47e/0x8d0
 [<ffffffff81173332>] kmem_getpages+0x62/0x170
 [<ffffffff81173f4a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8117399f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81173cc9>] ____cache_alloc_node+0x99/0x160
 [<ffffffff81174c4b>] kmem_cache_alloc+0x11b/0x190
 [<ffffffffa023894b>] spl_kmem_cache_alloc+0xab/0xee0 [spl]
 [<ffffffffa02c96f8>] ? buf_cons+0x48/0x60 [zfs]
 [<ffffffffa0238bab>] ? spl_kmem_cache_alloc+0x30b/0xee0 [spl]
 [<ffffffffa03330c8>] ? refcount_add_many+0x98/0x150 [zfs]
 [<ffffffff8152ae1e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8152ae1e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa03b1c1b>] zio_buf_alloc+0x5b/0x70 [zfs]
 [<ffffffffa02cf660>] arc_get_data_buf+0x4f0/0x690 [zfs]
 [<ffffffffa02cfa29>] arc_buf_alloc+0x129/0x210 [zfs]
 [<ffffffffa02df881>] __dbuf_hold_impl+0x691/0x9e0 [zfs]
 [<ffffffff811750ad>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa02dfc4d>] dbuf_hold_impl+0x7d/0xb0 [zfs]
 [<ffffffffa02e2c10>] dbuf_hold+0x20/0x30 [zfs]
 [<ffffffffa02ec1c7>] dmu_buf_hold_noread+0x87/0x220 [zfs]
 [<ffffffffa02ec39b>] dmu_buf_hold+0x3b/0x90 [zfs]
 [<ffffffffa036c341>] zap_get_leaf_byblk+0x81/0x740 [zfs]
 [<ffffffffa02ebfa5>] ? dmu_object_info_from_dnode+0x145/0x220 [zfs]
 [<ffffffffa036cad6>] zap_deref_leaf+0xd6/0x170 [zfs]
 [<ffffffffa036d0c7>] fzap_remove+0x37/0xb0 [zfs]
 [<ffffffffa0371ab4>] ? zap_name_alloc+0x84/0xe0 [zfs]
 [<ffffffffa037520b>] zap_remove_norm+0x1db/0x2c0 [zfs]
 [<ffffffff81296934>] ? snprintf+0x34/0x40
 [<ffffffffa0375303>] zap_remove+0x13/0x20 [zfs]
 [<ffffffffa036a8a1>] zap_remove_int+0x61/0x90 [zfs]
 [<ffffffffa037fe47>] zfs_rmnode+0x227/0x430 [zfs]
 [<ffffffffa03a7166>] zfs_zinactive+0x116/0x260 [zfs]
 [<ffffffffa03a1a8f>] zfs_inactive+0x7f/0x380 [zfs]
 [<ffffffffa03c14c0>] ? zpl_inode_delete+0x0/0x30 [zfs]
 [<ffffffffa03c134e>] zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811ab96c>] clear_inode+0xac/0x140
 [<ffffffffa03c14e0>] zpl_inode_delete+0x20/0x30 [zfs]
 [<ffffffff811ac06e>] generic_delete_inode+0xde/0x1d0
 [<ffffffff811ac1c5>] generic_drop_inode+0x65/0x80
 [<ffffffff811ab012>] iput+0x62/0x70
 [<ffffffffa023c872>] taskq_thread+0x202/0x480 [spl]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa023c670>] ? taskq_thread+0x0/0x480 [spl]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

This deadlock is of a similar form to the one in #3050. In specific, we are holding db->db_mtx in iput(), do a memory allocation that triggers direct reclaim and then try to evict the inode that we are evicting. Of couse, that wants db->db_mtx, so we deadlock. spl_fstrans_mark()/spl_fstrans_unmark() could prevent this.

That said, it seems like we need to do some tracing to find all locks taken in direct reclaim paths and make sure that all allocations under those locks are protected by spl_fstrans_mark().

The text was updated successfully, but these errors were encountered:

ryao · 2015-01-29T22:01:35Z

My original theory was not correct. There is code in the Linux VFS to prevent concurrent threads from trying to evict the same inode. kswapd0 is also blocked in the back traces:

INFO: task kswapd0:59 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-504.3.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D 0000000000000001     0    59      2 0x00000000
 ffff88007a705650 0000000000000046 ffff88007a705618 ffff88007a705614
 ffff88007a7055f0 ffff88007f823280 000003a5b7dfda56 ffff8800022158c0
 000000000000003f 0000000100389d07 ffff88007a6e1058 ffff88007a705fd8
Call Trace:
 [<ffffffff8152b306>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffffa03333b8>] ? refcount_remove_many+0x158/0x230 [zfs]
 [<ffffffff8152ae2b>] mutex_lock+0x2b/0x50
 [<ffffffffa02d86d0>] dbuf_find+0x100/0x220 [zfs]
 [<ffffffffa02df30f>] __dbuf_hold_impl+0x11f/0x9e0 [zfs]
 [<ffffffff811750ad>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa02dfc4d>] dbuf_hold_impl+0x7d/0xb0 [zfs]
 [<ffffffffa02e2c10>] dbuf_hold+0x20/0x30 [zfs]
 [<ffffffffa02ec1c7>] dmu_buf_hold_noread+0x87/0x220 [zfs]
 [<ffffffffa02ec39b>] dmu_buf_hold+0x3b/0x90 [zfs]
 [<ffffffffa036c341>] zap_get_leaf_byblk+0x81/0x740 [zfs]
 [<ffffffffa02ebfa5>] ? dmu_object_info_from_dnode+0x145/0x220 [zfs]
 [<ffffffffa036cad6>] zap_deref_leaf+0xd6/0x170 [zfs]
 [<ffffffffa036d0c7>] fzap_remove+0x37/0xb0 [zfs]
 [<ffffffffa0371ab4>] ? zap_name_alloc+0x84/0xe0 [zfs]
 [<ffffffffa037520b>] zap_remove_norm+0x1db/0x2c0 [zfs]
 [<ffffffff81296934>] ? snprintf+0x34/0x40
 [<ffffffffa0375303>] zap_remove+0x13/0x20 [zfs]
 [<ffffffffa036a8a1>] zap_remove_int+0x61/0x90 [zfs]
 [<ffffffffa037fe47>] zfs_rmnode+0x227/0x430 [zfs]
 [<ffffffffa03a7166>] zfs_zinactive+0x116/0x260 [zfs]
 [<ffffffffa03a1a8f>] zfs_inactive+0x7f/0x380 [zfs]
 [<ffffffffa03a6c0d>] ? zfs_inode_destroy+0x13d/0x1d0 [zfs]
 [<ffffffffa03c134e>] zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811ab96c>] clear_inode+0xac/0x140
 [<ffffffff811aba40>] dispose_list+0x40/0x120
 [<ffffffff811abd94>] shrink_icache_memory+0x274/0x2e0
 [<ffffffff8113d4ba>] shrink_slab+0x12a/0x1a0
 [<ffffffff8114082a>] balance_pgdat+0x57a/0x800
 [<ffffffff811468b6>] ? set_pgdat_percpu_threshold+0xa6/0xd0
 [<ffffffff81140be4>] kswapd+0x134/0x3b0
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81140ab0>] ? kswapd+0x0/0x3b0
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

http://buildbot.zfsonlinux.org/builders/centos-6.5-x86_64-builder/builds/2537/steps/shell_16/logs/stdio

It is not 100% clear what happened, but it looks like one thread blocked on DBUF_HASH_MUTEX(h, idx) in dbuf_find() while the other already had it and was blocked on db->db_mtx while it already had db->db_mtx for another inode's dbuf.

What is clear is that allowing direct reclaim inside inode eviction allows for arbitrarily long stack recursion, so my original idea to make sure all allocations under eviction paths are under spl_fstrans_mark() is worth implementing. It might even prevent the deadlock that occurred on the buildbot, but it is not clear to me how that deadlock happened. It would not make sense for two znodes to share the same dbuf and the fact that spinlocks function as barriers on amd64 should have made I_FREEING visible.

behlendorf · 2015-01-29T23:03:03Z

@ryao in the zfs_iput_taskq back trace dbuf_find() is blocked on the DBUF_HASH_MUTEX and not a db->db_mtx mutex. The kswapd process on the other hand is blocked on a db->db_mtx in dbuf_find() and that could very easily be holding the DBUF_HASH_MUTEX needed by zfs_iput_taskq. And to come full circle I suspect zfs_iput_taskq is holding the needed db->db_mtx father up the stack in the earlier dbuf_hold(). (This is convoluted but explains the deadlock)

There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS. This lets us make a a small change to the ZFS source where the mutex is initialized and then be certain that all future use of that mutex will be safe. NOTES: The ht_locks are no longer aligned on 64-byte boundaries. We've never studied if this is actually critical for performance when there are a large number of hash buckets. The dbuf hash has never made this optimization. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3050 Issue openzfs#3055

There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS. This lets us make a a small change to the ZFS source where the mutex is initialized and then be certain that all future use of that mutex will be safe. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3050 Issue openzfs#3055

There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS. This lets us make a a small change to the ZFS source where the mutex is initialized and then be certain that all future use of that mutex will be safe. NOTES: The ht_locks are no longer aligned on 64-byte boundaries. We've never studied if this is actually critical for performance when there are a large number of hash buckets. The dbuf hash has never made this optimization. Signed-off-by: Brian Behlendorf [email protected] Issue openzfs#3050 Issue openzfs#3055

There are regions in the ZFS code where it is desirable to be able to be set PF_FSTRANS while a specific mutex is held. The ZFS code could be updated to set/clear this flag in all the correct places, but this is undesirable for a few reasons. 1) It would require changes to a significant amount of the ZFS code. This would complicate applying patches from upstream. 2) It would be easy to accidentally miss a critical region in the initial patch or to have an future change introduce a new one. Both of these concerns can be addressed by adding a new mutex type which is responsible for managing PF_FSTRANS. This lets us make a a small change to the ZFS source where the mutex is initialized and then be certain that all future use of that mutex will be safe. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#3050 Issue openzfs#3055 Signed-off-by: Pavel Snajdr <[email protected]>

behlendorf added this to the 0.6.4 milestone Jan 29, 2015

behlendorf added Bug - Major labels Jan 29, 2015

behlendorf added the Component: Memory Management kernel memory management label Jan 29, 2015

This was referenced Feb 5, 2015

ZFS deadlock under IO load with the latest development version #3050

Closed

Another TXG Sync Crash #3062

Closed

behlendorf mentioned this issue Feb 26, 2015

Use MUTEX_FSTRANS mutex type #3132

Closed

behlendorf closed this as completed in 4ec15b8 Mar 3, 2015

tuxoko mentioned this issue Mar 24, 2015

Continued deadlocks following kmem-rework merge #3183

Closed

kernelOfTruth mentioned this issue Sep 6, 2015

Recurring ZFS deadlock; zfs_iput_taskq stuck at 100% for minutes #3687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iput_async() can deadlock in direct reclaim #3055

iput_async() can deadlock in direct reclaim #3055

ryao commented Jan 29, 2015

ryao commented Jan 29, 2015

behlendorf commented Jan 29, 2015

iput_async() can deadlock in direct reclaim #3055

iput_async() can deadlock in direct reclaim #3055

Comments

ryao commented Jan 29, 2015

ryao commented Jan 29, 2015

behlendorf commented Jan 29, 2015