Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iput_async() can deadlock in direct reclaim #3055

Closed
ryao opened this issue Jan 29, 2015 · 2 comments
Closed

iput_async() can deadlock in direct reclaim #3055

ryao opened this issue Jan 29, 2015 · 2 comments
Labels
Component: Memory Management kernel memory management
Milestone

Comments

@ryao
Copy link
Contributor

ryao commented Jan 29, 2015

I spotted the following in the build bot output:

INFO: task zfs_iput_taskq/:619 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-504.3.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zfs_iput_task D 0000000000000003     0   619      2 0x00000080
 ffff8800343caca0 0000000000000046 ffff8800343cac20 0000000000000140
 ffff8800343cac40 ffffffff81041e98 ffff8800343cac20 ffffffff8152c126
 ffff8800343cac30 ffffffff8100bb8e ffff88007bd5e5f8 ffff8800343cbfd8
Call Trace:
 [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8152c126>] ? down_read+0x16/0x30
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8152b306>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffffa03333b8>] ? refcount_remove_many+0x158/0x230 [zfs]
 [<ffffffff8152ae2b>] mutex_lock+0x2b/0x50
 [<ffffffffa02d8655>] dbuf_find+0x85/0x220 [zfs]
 [<ffffffffa02df30f>] __dbuf_hold_impl+0x11f/0x9e0 [zfs]
 [<ffffffff811750ad>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa02dfc4d>] dbuf_hold_impl+0x7d/0xb0 [zfs]
 [<ffffffffa02e2c10>] dbuf_hold+0x20/0x30 [zfs]
 [<ffffffffa02ec1c7>] dmu_buf_hold_noread+0x87/0x220 [zfs]
 [<ffffffffa02ec39b>] dmu_buf_hold+0x3b/0x90 [zfs]
 [<ffffffffa036c341>] zap_get_leaf_byblk+0x81/0x740 [zfs]
 [<ffffffffa02ebfa5>] ? dmu_object_info_from_dnode+0x145/0x220 [zfs]
 [<ffffffffa036cad6>] zap_deref_leaf+0xd6/0x170 [zfs]
 [<ffffffffa036d0c7>] fzap_remove+0x37/0xb0 [zfs]
 [<ffffffffa0371ab4>] ? zap_name_alloc+0x84/0xe0 [zfs]
 [<ffffffffa037520b>] zap_remove_norm+0x1db/0x2c0 [zfs]
 [<ffffffff81296934>] ? snprintf+0x34/0x40
 [<ffffffffa0375303>] zap_remove+0x13/0x20 [zfs]
 [<ffffffffa036a8a1>] zap_remove_int+0x61/0x90 [zfs]
 [<ffffffffa037fe47>] zfs_rmnode+0x227/0x430 [zfs]
 [<ffffffffa03a7166>] zfs_zinactive+0x116/0x260 [zfs]
 [<ffffffffa03a1a8f>] zfs_inactive+0x7f/0x380 [zfs]
 [<ffffffffa03a6c0d>] ? zfs_inode_destroy+0x13d/0x1d0 [zfs]
 [<ffffffffa03c134e>] zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811ab96c>] clear_inode+0xac/0x140
 [<ffffffff811aba40>] dispose_list+0x40/0x120
 [<ffffffff811abd94>] shrink_icache_memory+0x274/0x2e0
 [<ffffffff8113d4ba>] shrink_slab+0x12a/0x1a0
 [<ffffffff8113f8c7>] do_try_to_free_pages+0x3f7/0x610
 [<ffffffff8113fcb2>] try_to_free_pages+0x92/0x120
 [<ffffffff811340be>] __alloc_pages_nodemask+0x47e/0x8d0
 [<ffffffff81173332>] kmem_getpages+0x62/0x170
 [<ffffffff81173f4a>] fallback_alloc+0x1ba/0x270
 [<ffffffff8117399f>] ? cache_grow+0x2cf/0x320
 [<ffffffff81173cc9>] ____cache_alloc_node+0x99/0x160
 [<ffffffff81174c4b>] kmem_cache_alloc+0x11b/0x190
 [<ffffffffa023894b>] spl_kmem_cache_alloc+0xab/0xee0 [spl]
 [<ffffffffa02c96f8>] ? buf_cons+0x48/0x60 [zfs]
 [<ffffffffa0238bab>] ? spl_kmem_cache_alloc+0x30b/0xee0 [spl]
 [<ffffffffa03330c8>] ? refcount_add_many+0x98/0x150 [zfs]
 [<ffffffff8152ae1e>] ? mutex_lock+0x1e/0x50
 [<ffffffff8152ae1e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa03b1c1b>] zio_buf_alloc+0x5b/0x70 [zfs]
 [<ffffffffa02cf660>] arc_get_data_buf+0x4f0/0x690 [zfs]
 [<ffffffffa02cfa29>] arc_buf_alloc+0x129/0x210 [zfs]
 [<ffffffffa02df881>] __dbuf_hold_impl+0x691/0x9e0 [zfs]
 [<ffffffff811750ad>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa02dfc4d>] dbuf_hold_impl+0x7d/0xb0 [zfs]
 [<ffffffffa02e2c10>] dbuf_hold+0x20/0x30 [zfs]
 [<ffffffffa02ec1c7>] dmu_buf_hold_noread+0x87/0x220 [zfs]
 [<ffffffffa02ec39b>] dmu_buf_hold+0x3b/0x90 [zfs]
 [<ffffffffa036c341>] zap_get_leaf_byblk+0x81/0x740 [zfs]
 [<ffffffffa02ebfa5>] ? dmu_object_info_from_dnode+0x145/0x220 [zfs]
 [<ffffffffa036cad6>] zap_deref_leaf+0xd6/0x170 [zfs]
 [<ffffffffa036d0c7>] fzap_remove+0x37/0xb0 [zfs]
 [<ffffffffa0371ab4>] ? zap_name_alloc+0x84/0xe0 [zfs]
 [<ffffffffa037520b>] zap_remove_norm+0x1db/0x2c0 [zfs]
 [<ffffffff81296934>] ? snprintf+0x34/0x40
 [<ffffffffa0375303>] zap_remove+0x13/0x20 [zfs]
 [<ffffffffa036a8a1>] zap_remove_int+0x61/0x90 [zfs]
 [<ffffffffa037fe47>] zfs_rmnode+0x227/0x430 [zfs]
 [<ffffffffa03a7166>] zfs_zinactive+0x116/0x260 [zfs]
 [<ffffffffa03a1a8f>] zfs_inactive+0x7f/0x380 [zfs]
 [<ffffffffa03c14c0>] ? zpl_inode_delete+0x0/0x30 [zfs]
 [<ffffffffa03c134e>] zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811ab96c>] clear_inode+0xac/0x140
 [<ffffffffa03c14e0>] zpl_inode_delete+0x20/0x30 [zfs]
 [<ffffffff811ac06e>] generic_delete_inode+0xde/0x1d0
 [<ffffffff811ac1c5>] generic_drop_inode+0x65/0x80
 [<ffffffff811ab012>] iput+0x62/0x70
 [<ffffffffa023c872>] taskq_thread+0x202/0x480 [spl]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa023c670>] ? taskq_thread+0x0/0x480 [spl]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

This deadlock is of a similar form to the one in #3050. In specific, we are holding db->db_mtx in iput(), do a memory allocation that triggers direct reclaim and then try to evict the inode that we are evicting. Of couse, that wants db->db_mtx, so we deadlock. spl_fstrans_mark()/spl_fstrans_unmark() could prevent this.

That said, it seems like we need to do some tracing to find all locks taken in direct reclaim paths and make sure that all allocations under those locks are protected by spl_fstrans_mark().

@ryao
Copy link
Contributor Author

ryao commented Jan 29, 2015

My original theory was not correct. There is code in the Linux VFS to prevent concurrent threads from trying to evict the same inode. kswapd0 is also blocked in the back traces:

INFO: task kswapd0:59 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-504.3.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D 0000000000000001     0    59      2 0x00000000
 ffff88007a705650 0000000000000046 ffff88007a705618 ffff88007a705614
 ffff88007a7055f0 ffff88007f823280 000003a5b7dfda56 ffff8800022158c0
 000000000000003f 0000000100389d07 ffff88007a6e1058 ffff88007a705fd8
Call Trace:
 [<ffffffff8152b306>] __mutex_lock_slowpath+0x96/0x210
 [<ffffffffa03333b8>] ? refcount_remove_many+0x158/0x230 [zfs]
 [<ffffffff8152ae2b>] mutex_lock+0x2b/0x50
 [<ffffffffa02d86d0>] dbuf_find+0x100/0x220 [zfs]
 [<ffffffffa02df30f>] __dbuf_hold_impl+0x11f/0x9e0 [zfs]
 [<ffffffff811750ad>] ? __kmalloc_node+0x4d/0x60
 [<ffffffffa02dfc4d>] dbuf_hold_impl+0x7d/0xb0 [zfs]
 [<ffffffffa02e2c10>] dbuf_hold+0x20/0x30 [zfs]
 [<ffffffffa02ec1c7>] dmu_buf_hold_noread+0x87/0x220 [zfs]
 [<ffffffffa02ec39b>] dmu_buf_hold+0x3b/0x90 [zfs]
 [<ffffffffa036c341>] zap_get_leaf_byblk+0x81/0x740 [zfs]
 [<ffffffffa02ebfa5>] ? dmu_object_info_from_dnode+0x145/0x220 [zfs]
 [<ffffffffa036cad6>] zap_deref_leaf+0xd6/0x170 [zfs]
 [<ffffffffa036d0c7>] fzap_remove+0x37/0xb0 [zfs]
 [<ffffffffa0371ab4>] ? zap_name_alloc+0x84/0xe0 [zfs]
 [<ffffffffa037520b>] zap_remove_norm+0x1db/0x2c0 [zfs]
 [<ffffffff81296934>] ? snprintf+0x34/0x40
 [<ffffffffa0375303>] zap_remove+0x13/0x20 [zfs]
 [<ffffffffa036a8a1>] zap_remove_int+0x61/0x90 [zfs]
 [<ffffffffa037fe47>] zfs_rmnode+0x227/0x430 [zfs]
 [<ffffffffa03a7166>] zfs_zinactive+0x116/0x260 [zfs]
 [<ffffffffa03a1a8f>] zfs_inactive+0x7f/0x380 [zfs]
 [<ffffffffa03a6c0d>] ? zfs_inode_destroy+0x13d/0x1d0 [zfs]
 [<ffffffffa03c134e>] zpl_clear_inode+0xe/0x10 [zfs]
 [<ffffffff811ab96c>] clear_inode+0xac/0x140
 [<ffffffff811aba40>] dispose_list+0x40/0x120
 [<ffffffff811abd94>] shrink_icache_memory+0x274/0x2e0
 [<ffffffff8113d4ba>] shrink_slab+0x12a/0x1a0
 [<ffffffff8114082a>] balance_pgdat+0x57a/0x800
 [<ffffffff811468b6>] ? set_pgdat_percpu_threshold+0xa6/0xd0
 [<ffffffff81140be4>] kswapd+0x134/0x3b0
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81140ab0>] ? kswapd+0x0/0x3b0
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

http://buildbot.zfsonlinux.org/builders/centos-6.5-x86_64-builder/builds/2537/steps/shell_16/logs/stdio

It is not 100% clear what happened, but it looks like one thread blocked on DBUF_HASH_MUTEX(h, idx) in dbuf_find() while the other already had it and was blocked on db->db_mtx while it already had db->db_mtx for another inode's dbuf.

What is clear is that allowing direct reclaim inside inode eviction allows for arbitrarily long stack recursion, so my original idea to make sure all allocations under eviction paths are under spl_fstrans_mark() is worth implementing. It might even prevent the deadlock that occurred on the buildbot, but it is not clear to me how that deadlock happened. It would not make sense for two znodes to share the same dbuf and the fact that spinlocks function as barriers on amd64 should have made I_FREEING visible.

@behlendorf behlendorf added this to the 0.6.4 milestone Jan 29, 2015
@behlendorf
Copy link
Contributor

@ryao in the zfs_iput_taskq back trace dbuf_find() is blocked on the DBUF_HASH_MUTEX and not a db->db_mtx mutex. The kswapd process on the other hand is blocked on a db->db_mtx in dbuf_find() and that could very easily be holding the DBUF_HASH_MUTEX needed by zfs_iput_taskq. And to come full circle I suspect zfs_iput_taskq is holding the needed db->db_mtx father up the stack in the earlier dbuf_hold(). (This is convoluted but explains the deadlock)

@behlendorf behlendorf added the Component: Memory Management kernel memory management label Jan 29, 2015
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 26, 2015
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS.  This lets us make a
a small change to the ZFS source where the mutex is initialized
and then be certain that all future use of that mutex will be safe.

NOTES: The ht_locks are no longer aligned on 64-byte boundaries.
We've never studied if this is actually critical for performance
when there are a large number of hash buckets.  The dbuf hash has
never made this optimization.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3050
Issue openzfs#3055
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 27, 2015
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS.  This lets us make a
a small change to the ZFS source where the mutex is initialized
and then be certain that all future use of that mutex will be safe.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3050
Issue openzfs#3055
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 2, 2015
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held. The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
code. This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
the initial patch or to have an future change introduce a
new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS. This lets us make a
a small change to the ZFS source where the mutex is initialized
and then be certain that all future use of that mutex will be safe.

NOTES: The ht_locks are no longer aligned on 64-byte boundaries.
We've never studied if this is actually critical for performance
when there are a large number of hash buckets. The dbuf hash has
never made this optimization.

Signed-off-by: Brian Behlendorf [email protected]
Issue openzfs#3050
Issue openzfs#3055
snajpa pushed a commit to vpsfreecz/zfs that referenced this issue Mar 3, 2015
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS.  This lets us make a
a small change to the ZFS source where the mutex is initialized
and then be certain that all future use of that mutex will be safe.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3050
Issue openzfs#3055
Signed-off-by: Pavel Snajdr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

2 participants