-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continued deadlocks following kmem-rework merge #3183
Comments
This patch helps the most common cases. I can still get other deadlocks, however.
|
For example RHEL6 kernel is built without CONFIG_PARAVIRT_SPINLOCKS. |
The patches in dweeezil/zfs@f04f972 make my test system much more resistant to deadlocks but there are still some remaining. I'd also like to note that blaming this on the "kmem-rework merge" is not totally accurate. This class of deadlocks are more correctly related to deprecating KM_PUSHPAGE and relying instead on the correct use of PF_FSTRANS. The merge simply gave us the ability to use the PF_FSTRANS flag. |
This patch helps the most common cases. I can still get other deadlocks, however. openzfs#3183
I'm withdrawing my last comment. More information as it becomes available. |
In a previous posting, I suggested these deadlocks would also occur prior to the kmem-rework merge. That actually is not quite the case, however, there is some definite Bad Behavior when running these stress tests pre-kmem-rework merge (SPL at openzfs/spl@47af4b7 and ZFS at d958324) on a system with memory limited to 4GiB. Sorry for the confusion (I wasn't running the versions I though I was due to a "+" sneaking into the localversion of the kernel under which I wanted to run the tests). |
@dweeezil |
not exactly sure under which issue or pull-request to put this but http://marc.info/?l=linux-kernel&m=142601686103000&w=2 Seems related and could be helpful in countering some lockups, if not directly - then to get a better knowledge on kernel internals ...
|
Without addressing either @tuxoko's last question or the posting referred to by @kernelOfTruth (both of which I'd like to address ASAP), here's a quick update on this issue: Although the patch(es) in https://github.com/dweeezil/zfs/tree/post-kmem-rework-deadlocks have made my test system survive the stress testing much better, there are still some more deadlocks I can reproduce and need to track down (not counting @tuxoko's L2ARC fix). One remaining issue appears to involve I'm more than a little concerned about the manner in which these deadlocks need to be addressed. Previously, as @behlendorf has mentioned, large number of allocations were under the (old semantics of) I don't want to suggest we revert to the old way. In fact, running my stress tests on pre-kmem-rework code result in other Very Bad Things happening (lots of write issue threads spinning, etc.) so we're definitely heading in the right direction. Finally, I'd like to address something tangentially related to @tuxoko's query as to the purpose of the |
First off thanks @dweeezil and @tuxoko for taking the time to look at this issue. Let me try and shed some light on this.
Good question, and I wish I had an authoritative answer. This is something which was preserved from the upstream OpenSolaris code but it's never been 100% clear to me we actually need it. It's designed to serialize access to a given object id from the VFS, and the comments suggest this is important for
This is definitely a step in the right direction since it leverages the proper Linux infrastructure. However, it might take us a little while to run down all these new potential deadlocks which were hidden before. We could consider applying PF_FSTRANS more broadly as a stop gap measure. This would be roughly equivalent the the previous KM_PUSHPAGE approach which was broadly applied. One very easy way to do this would be to change the zsb->z_hold_mtx type to MUTEX_FSTRANS.
This is actually a fix which predates the wide use of KM_PUSHPAGE. What could happen is that in a KM_SLEEP allocation could causes us to to enter |
@behlendorf Consider mutex A and B both are MUTEX_FSTRANS:
I don't know if ZFS contain this kind of ordering, but I don't think it would be a good idea to have this kind of restriction on the mutex. Also, I don't think this could be solved transparently unless we have some sort of per thread counter ... |
@behlendorf |
My most recent deadlock is as follows: Kernel threads: kswapd0 D 0000000000000001 0 388 2 0x00000000 ffff88041bc7b9f0 0000000000000046 ffff88001d6805f8 ffff88041bc7a010 ffff88041d9e7920 0000000000010900 ffff88041bc7bfd8 0000000000004000 ffff88041bc7bfd8 0000000000010900 ffff88040da73960 ffff88041d9e7920 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108bf56>] balance_pgdat+0x3a8/0x7ea [<ffffffff8108c64c>] kswapd+0x2b4/0x2e8 [<ffffffff8103f2c0>] ? wake_up_bit+0x25/0x25 [<ffffffff8108c398>] ? balance_pgdat+0x7ea/0x7ea [<ffffffff8103eecc>] kthread+0x84/0x8c [<ffffffff81314014>] kernel_thread_helper+0x4/0x10 [<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d [<ffffffff81314010>] ? gs_change+0xb/0xb khugepaged D 0000000000000000 0 493 2 0x00000000 ffff88041bd8b850 0000000000000046 ffff88041204da78 ffff88041bd8a010 ffff88041da8b300 0000000000010900 ffff88041bd8bfd8 0000000000004000 ffff88041bd8bfd8 0000000000010900 ffffffff81613020 ffff88041da8b300 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffff810adce9>] ? unfreeze_partials+0x1eb/0x1eb [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4 [<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a [<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e [<ffffffff810b351c>] khugepaged+0x91/0x1051 [<ffffffff8103f2c0>] ? wake_up_bit+0x25/0x25 [<ffffffff810b348b>] ? collect_mm_slot+0xa0/0xa0 [<ffffffff8103eecc>] kthread+0x84/0x8c [<ffffffff81314014>] kernel_thread_helper+0x4/0x10 [<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d [<ffffffff81314010>] ? gs_change+0xb/0xb arc_adapt D 0000000000000000 0 1593 2 0x00000000 ffff880413701b20 0000000000000046 ffff880413701b30 ffff880413700010 ffff88041da09980 0000000000010900 ffff880413701fd8 0000000000004000 ffff880413701fd8 0000000000010900 ffff88018ced1320 ffff88041da09980 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffffa016fc93>] zfs_sb_prune+0x76/0x9b [zfs] [<ffffffffa018d119>] zpl_prune_sb+0x21/0x24 [zfs] [<ffffffffa00f0a37>] arc_adapt_thread+0x28b/0x497 [zfs] [<ffffffffa018d0f8>] ? zpl_inode_alloc+0x6b/0x6b [zfs] [<ffffffffa00f07ac>] ? arc_adjust+0x105/0x105 [zfs] [<ffffffffa00a4cd8>] ? __thread_create+0x122/0x122 [spl] [<ffffffffa00a4d44>] thread_generic_wrapper+0x6c/0x79 [spl] [<ffffffffa00a4cd8>] ? __thread_create+0x122/0x122 [spl] [<ffffffff8103eecc>] kthread+0x84/0x8c [<ffffffff81314014>] kernel_thread_helper+0x4/0x10 [<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d [<ffffffff81314010>] ? gs_change+0xb/0xb User proc(s): usage.pl D 0000000000000001 0 8726 8169 0x00000000 ffff880206af11c8 0000000000000082 ffff880206af1128 ffff880206af0010 ffff880365a31980 0000000000010900 ffff880206af1fd8 0000000000004000 ffff880206af1fd8 0000000000010900 ffff88040d400000 ffff880365a31980 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff812063a5>] ? scsi_dma_map+0x82/0x9d [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4 [<ffffffff810845c9>] ? get_page_from_freelist+0x477/0x4a8 [<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a [<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e [<ffffffffa014c861>] ? vdev_disk_io_start+0x13b/0x156 [zfs] [<ffffffff810af134>] new_slab+0x65/0x1ee [<ffffffff810af5c7>] __slab_alloc.clone.1+0x15c/0x239 [<ffffffffa00a2c91>] ? spl_kmem_alloc+0xfc/0x17d [spl] [<ffffffffa00f68fa>] ? dbuf_read+0x632/0x832 [zfs] [<ffffffffa00a2c91>] ? spl_kmem_alloc+0xfc/0x17d [spl] [<ffffffff810af778>] __kmalloc+0xd4/0x127 [<ffffffffa00a2c91>] spl_kmem_alloc+0xfc/0x17d [spl] [<ffffffffa010fb38>] dnode_hold_impl+0x24f/0x53e [zfs] [<ffffffffa010fe3b>] dnode_hold+0x14/0x16 [zfs] [<ffffffffa00ff716>] dmu_bonus_hold+0x24/0x288 [zfs] [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa012e343>] sa_buf_hold+0x9/0xb [zfs] [<ffffffffa017d296>] zfs_zget+0xc9/0x3b3 [zfs] [<ffffffffa0157481>] ? zap_unlockdir+0x8f/0xa3 [zfs] [<ffffffffa0161cc9>] zfs_dirent_lock+0x4c6/0x50b [zfs] [<ffffffffa016208b>] zfs_dirlook+0x21b/0x284 [zfs] [<ffffffffa015c15c>] ? zfs_zaccess+0x2b0/0x346 [zfs] [<ffffffffa017915d>] zfs_lookup+0x267/0x2ad [zfs] [<ffffffffa018c4a3>] zpl_lookup+0x5b/0xad [zfs] [<ffffffff810c1d38>] __lookup_hash+0xb9/0xdd [<ffffffff810c1fc5>] do_lookup+0x269/0x2a4 [<ffffffff810c343a>] path_lookupat+0xe9/0x5c5 [<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0 [<ffffffff810c3934>] do_path_lookup+0x1e/0x54 [<ffffffff810c3ba2>] user_path_at_empty+0x50/0x96 [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa01769f9>] ? zfs_getattr_fast+0x175/0x186 [zfs] [<ffffffff810c3bb7>] ? user_path_at_empty+0x65/0x96 [<ffffffff810bc8c0>] ? cp_new_stat+0xf2/0x108 [<ffffffff810c3bf4>] user_path_at+0xc/0xe [<ffffffff810bca50>] vfs_fstatat+0x3d/0x68 [<ffffffff810bcb54>] vfs_stat+0x16/0x18 [<ffffffff810bcb70>] sys_newstat+0x1a/0x38 [<ffffffff81312f62>] system_call_fastpath+0x16/0x1b usage.pl D 0000000000000001 0 9251 8462 0x00000000 ffff88010a7baca8 0000000000000082 ffff88010a7bab98 ffff88010a7ba010 ffff88040da75940 0000000000010900 ffff88010a7bbfd8 0000000000004000 ffff88010a7bbfd8 0000000000010900 ffff880365a33300 ffff88040da75940 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4 [<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a [<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e [<ffffffff810af134>] new_slab+0x65/0x1ee [<ffffffff810af5c7>] __slab_alloc.clone.1+0x15c/0x239 [<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffff810af818>] kmem_cache_alloc+0x4d/0xa0 [<ffffffffa00a36cf>] spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0 [<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffffa0185ccb>] zio_create+0x50/0x2ce [zfs] [<ffffffffa0186d50>] zio_read+0x97/0x99 [zfs] [<ffffffffa00f183a>] ? arc_fini+0x6be/0x6be [zfs] [<ffffffffa00efb7f>] arc_read+0x8f5/0x93a [zfs] [<ffffffffa00a2b14>] ? spl_kmem_zalloc+0xf1/0x172 [spl] [<ffffffffa00f92d2>] dbuf_prefetch+0x2a7/0x2c9 [zfs] [<ffffffffa010c2c4>] dmu_zfetch_dofetch+0xe0/0x122 [zfs] [<ffffffffa010ca42>] dmu_zfetch+0x73c/0xfb9 [zfs] [<ffffffffa00f640c>] dbuf_read+0x144/0x832 [zfs] [<ffffffffa00a2b14>] ? spl_kmem_zalloc+0xf1/0x172 [spl] [<ffffffff810af3b6>] ? kfree+0x1b/0xad [<ffffffffa00a274c>] ? spl_kmem_free+0x35/0x37 [spl] [<ffffffffa010face>] dnode_hold_impl+0x1e5/0x53e [zfs] [<ffffffffa010fe3b>] dnode_hold+0x14/0x16 [zfs] [<ffffffffa00ff716>] dmu_bonus_hold+0x24/0x288 [zfs] [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa012e343>] sa_buf_hold+0x9/0xb [zfs] [<ffffffffa017d296>] zfs_zget+0xc9/0x3b3 [zfs] [<ffffffffa0157481>] ? zap_unlockdir+0x8f/0xa3 [zfs] [<ffffffffa0161cc9>] zfs_dirent_lock+0x4c6/0x50b [zfs] [<ffffffffa016208b>] zfs_dirlook+0x21b/0x284 [zfs] [<ffffffffa015c15c>] ? zfs_zaccess+0x2b0/0x346 [zfs] [<ffffffffa017915d>] zfs_lookup+0x267/0x2ad [zfs] [<ffffffffa018c4a3>] zpl_lookup+0x5b/0xad [zfs] [<ffffffff810c1d38>] __lookup_hash+0xb9/0xdd [<ffffffff810c1fc5>] do_lookup+0x269/0x2a4 [<ffffffff810c343a>] path_lookupat+0xe9/0x5c5 [<ffffffff810c23b5>] ? getname_flags+0x2b/0x1bc [<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0 [<ffffffff810c3934>] do_path_lookup+0x1e/0x54 [<ffffffff810c3ba2>] user_path_at_empty+0x50/0x96 [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa01769f9>] ? zfs_getattr_fast+0x175/0x186 [zfs] [<ffffffff810bc8c0>] ? cp_new_stat+0xf2/0x108 [<ffffffff810c3bf4>] user_path_at+0xc/0xe [<ffffffff810bca50>] vfs_fstatat+0x3d/0x68 [<ffffffff810bcb54>] vfs_stat+0x16/0x18 [<ffffffff810bcb70>] sys_newstat+0x1a/0x38 [<ffffffff81312f62>] system_call_fastpath+0x16/0x1b Kernel version 3.2.86 SPL: Loaded module v0.6.3-77_g79a0056 ZFS: Loaded module v0.6.3-265_gb2e5a32, ZFS pool version 5000, ZFS filesystem version 5 pool: whoopass1 state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub canceled on Mon Mar 23 09:37:24 2015 config: NAME STATE READ WRITE CKSUM whoopass1 ONLINE 0 0 0 sdc ONLINE 0 0 0 cache SSD-l2arc ONLINE 0 0 0 errors: No known data errors Events in progress at time of hang are primarily a boatload of |
Ugh... posted this in the wrong place a little while ago... I think I've tracked down another deadlock. This one involves zf_rwlock acquired in dmu_zfetch() and z_hold_mtx acquired in zfs_get(). One process has this stack:
and the other is like this (it's a long one):
Deadlock... |
I had been zeroing in on dmu_zfetch & friends earlier on in my work on this. |
@dweeezil could you open a pull request with the fixes you've already put together for the deadlocks. I don't merged anything until we're all agreed they're the right fixes but it would be nice to have them all in on place. |
@behlendorf I'll do that later this evening. I was able to run my stress test with my latest patch for awhile this afternoon with no hangs. |
@kpande Thanks. I'll follow-up later today. I'm going to post my WIP branch as a pull request shortly so it can get some more attention and will add a couple more comments there. |
I'm going to be moving all my future updates to #3225. I'm still able to create deadlocks, but have to try much much harder now. Follow-up in the referenced pull request. |
The vast majority of these deadlocks have been resolved by the patch in #3225. We'll address any remaining issues in individual issues. |
As outlined in the kmem-rework merge, and follow-up commits such as 4ec15b8, there are and may be additional code paths within ZFS which require inhibiting filesystem re-entry during reclaim.
I've identified several conditions under which a system can be deadlocked rather easily. The easiest instance to reproduce involves the interaction beteween
zfs_zinactive()
andzfs_zget()
due to the manner in which thez_hold_mtx[]
mutex array is used. zfs_zinactive() has the following code to try to prevent deadlocks:I've produced a patch which flattens the
z_hold_mtx[]
array from a 256-element hash to a series of 4 consecutive mutex (also used in a hash-like manner) in order to get better diagnostics from Linux's lock debugging facilities. Here's an example from one of this class of deadlocks:The first process is holding element 0 while trying to acquire element 2 and the second is holding element 2 while trying to acquire element 0. If the stock ZFS code were being used, the locks would simply show up as "z_hold_mtx[i]". Here's the interesting part of the stack trace from the first process:
The stack trace from the second process is almost identical so I'll not post it. This class of deadlock appears to be caused by the hash table use of the mutex array: there are only 256 mutexes for an arbitrary number of objects.
The most obvious solutions to fix this deadlock are to either use
spl_fstrans_mark()
inzfs_zget()
or to use the newMUTEX_FSTRANS
flag on the mutexes. Unfortunately, neither work to completely eliminate this class of deadlocks. Adding the fstrans flag tozfs_zget()
allows the same deadlock to occur via a different path. The common path becomessa_hold_buf
.Again, the second stack is pretty much identical.
Moving the whole array of mutex under the FSTRANS lock mode still allows a deadlock to occur but I've not yet identified the paths involved.
The reproducing scripts are available in https://github.com/dweeezil/stress-tests. They're trivial scripts which perform a bunch of simple operations in parallel. Given a fully populated test tree, I've generally only needed to run the
create.sh
script in order to produce a deadlock. With the cut-down pseudo mutex array under MUTEX_FSTRANS, I generally also need to run theremove-random.sh
script concurrently in order to produce a deadlock.One other note is that I first encountered these deadlocks on a system with either 4 or 8GiB of RAM (I can't remember which). I've been using
mem=4g
in my kernel boot line to reproduce this. I'll also note that my test system has no swap space. My goal was to exercise the system with highly parallel operations and with lots of reclaim happening.These scripts are trivially simple and with stock code on a system in which reclaim is necessary will generally cause a deadlock in less than a minute so I do consider this to be a Big Problem.
The text was updated successfully, but these errors were encountered: