-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS locks system under memory pressure #266
Comments
Interesting stack! Are you able to consistently reproduce this or has it been a one time event? On the surface this looks like a pretty unlikely (but clearly possible) deadlock. |
I was able to obtain similar behavior again under the same conditions, (processes freezing, ssh sessions unresponsive for >60s, kernel traces...) but the exact messages in syslog were different. It's a little longer this time, so here's an edited syslog. Jun 8 15:37:04 atlantis kernel: [76792.638451] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9220082879129529129 then there are messages that seem to relate to other processes that were dying... Jun 8 15:42:29 atlantis kernel: [77066.595577] oom_kill_process: 27 callbacks suppressed Jun 8 15:42:29 atlantis kernel: [77066.610183] active_anon:3702813 inactive_anon:640 isolated_anon:0 then several more large block following the same pattern with other processes, then- Jun 8 15:42:29 atlantis kernel: [77067.341948] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9162052222399601381 that's all before I forced a reboot. |
The second issue here, the negative objects, is issue #218 which was fixed after 0.6.0-rc4. If you pull the latest master source you shouldn't be able recreate that issue. There was also a couple deadlocks fixed but none that looked like your first reported issue. |
The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens:
When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock. This can be reliably reproduced like this:
What this means is that we must not call zfs_zget() from zfs_zinactive(). One way of doing that would be to defer the zfs_purgedir() call until after the inodes are evicted. Also, #273 is most likely duplicate of this bug. |
Gunnar Beutner nicely summerized the root cause of this issue in a comment for Github issue #266. The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens: 1. File is created. 2. xattr is created for that file (behind the scenes a xattr directory and a file in that xattr directory are created) 3. File is deleted. 4. Both the xattr directory inode and at least one xattr inode from that directory are evicted by prune_icache(); prune_icache() acquires a lock on both inodes before it calls ->evict() on the inodes When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock. This can be reliably reproduced like this: --- This patch proposes to fix the issue by deferring the call to zfs_rmnode() until after the inodes are evicted. This is accomplished by dispatching the zfs_rmnode() call to the existing zfs_iput_taskq. There is no requirement that it must complete before zfs_zinactive() returns. The inode is already unlinked from the namespace an inactive so this should be safe. Issue #266
Nice job! I agree, that's exactly the problem. I was able to easily recreate the issue with your test case. As for the fix we're definitely going to have to move the zfs_zget() call out of the zfs_inactive() call path to avoid the deadlock. However, it doesn't look like deferring zfs_purgedir() goes far enough, we might hit a similar deadlock farther down in zfs_rmnode() where zfs_zget() is called on the xattr_obj directory. But from my reading of the code it should be safe to defer all of the zfs_rmnode() call. The only caller of zfs_rmnode() is zfs_zinactive() and it is only done in the unlink case. At this point everything is already unlinked from the namespace and the only remaining references on the inodes are held by ZFS. Additionally, all this code properly adds the unlinked objects to the unlinked set so they should be correctly handled in the case of a system crash. Gunnar, what do you think about this proposed fix, 6e34ce7 . With this change in place I'm unable to recreate the issue nor have I seen any unexpected side effects. |
After applying the patch I'm seeing the following crash: [ 104.893196] SPLError: 5275:0:(zfs_znode.c:237:zfs_znode_dmu_fini()) ASSERTION(MUTEX_HELD(ZFS_OBJ_MUTEX(ZTOZSB(zp), zp->z_id)) || zp->z_unlinked || RW_WRITE_HELD(&ZTOZSB(zp)->z_teardown_inactive_lock)) failed By the time zfs_rmnode() is actually called the znode_t* we pass to it might've actually been freed by the prune_icache() thread. Instead we would probably have to pass the zsb + obj_num (possibly along with the generation number just to be absolutely sure we're deleting the right inode, although this should be safe because no other process can obtain a reference to the deleted xattr dir) and call zfs_zget() to get a new reference to the xattr dir. |
It's never that easy is it. Deferring this safely looks like it's going to be problematic. What if we try a different approach. The ilookup() in zfs_zget() above is really just blocked on the I_FREEING flag in the inode. If we could somehow safely detect this we could just skip this object id and avoid the deadlock entirely. The zfs_purgedir() code is already setup to leave the xattr dir in the unlink set if entries are skipped (because of lack of free space). We could leverage that existing machinery. How exactly this could be done cleanly isn't obvious to me however. I'm open to all good ideas! |
There's one other invariant here we need to be careful off. The passed znode/inode must be free'd as part of evict(). This cannot be deferred because after return the znode/inode may be immediately reused. Other znode/inodes such as the xattr directory, or xattrs themselves, can be safely deferred at least until prune_icache() explicitly evicts them. |
Here's my latest idea: 4322a5782e384d008e8c64bbbfb478c143d4c652 I've moved the zfs_purgedir() call to zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the xattr dir is empty and leaves the xattr dir in the unlinked set if it finds any xattrs. There's at least one possible locking problem with the new code which I still need to think about: The ZFS superblock I pass to the taskq is not locked/ref-counted. If someone manages to unmount the filesystem before the taskq is scheduled zfs_unlinked_drain() might use a stale zsb struct - assuming it gets scheduled before the taskq is destroyed. Also: it would probably be a good idea to have the zfs_unlinked_drain thread wait for a little while before actually removing znodes from the unlinked set - in case someone is simultaneously removing lots of files with xattrs. This might save us some unnecessary context switches. |
That seems reasonable and correct to me. I took your patch one step farther and handled the super block reference problem by adding a call to taskq_wait() in zfsvfs_teardown(). This will be called as part of .put_super() which allows us to block until all of the work items we added to the taskq have been handled. Without this the feared reference problem was easy to hit as follows: touch /tank/testattr -s a -V b /tank/testrm /tank/testzfs umount tankIf this looks good to you I'll merge it in to master. See commit c3ff7ef8c0efd66be465ccdee6fa19ae215724d7 |
Reopening... for now. The commit comment closed it even though it wasn't committed to master. I'll need to keep that in mind in the future. |
Actually, see commit b00131d It's the same as the previous commit I just updated the commit message. |
Gunnar, I'd like to merge commit b00131d in to master. So far it appears to be working well. But if you have any remaining concerns please let us know. If not I'll try and merge it by the end of Tues PST. |
Should be fine as far as I can tell. All the other possible locking issues I can think of are nicely taken care of by zfs_zget() returning ENOENT when another thread is already holding a reference to (partially) deleted inodes. |
Because spl_slab_size() was always returning -ENOSPC for caches of type KMC_OFFSLAB the cache could never be created. Additionally the slab size is rounded up to a page which is what kv_alloc() expects. The kv_alloc() code will minimally allocate a page, in the KMC_OFFSLAB case this could be reduced. The basic regression tests kmem:slab_small, kmem:slab_large, and kmem:slab_align regression were updated to test KMC_OFFSLAB. Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Ying Zhu <[email protected]> Closes openzfs#266
openzfs#266) Include component name and resource name in pool error messages zpool logs with ecode, msg and reason Signed-off-by: moteesh <[email protected]>
…eneration (openzfs#266)" (openzfs#274) This reverts commit d481a92. Signed-off-by: Naveen Bellary [email protected]
LockSet can be simplified by using watch_once, and always dropping the Sender (without .send()-ing). Remove some commented-out code in watch_once.rs.
NAS-133441 / None / Sync truenas/zfs-2.3-release with upstream zfs-2.3-rc5 tag
Under memory pressure, with no swap configured (24G physical ram)
Jun 7 14:54:50 atlantis kernel: [838087.621465] INFO: task kswapd0:94 blocked for more than 120 seconds.
Jun 7 14:54:50 atlantis kernel: [838087.621482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 7 14:54:50 atlantis kernel: [838087.621488] kswapd0 D 0000000000000001 0 94 2 0x00000000
Jun 7 14:54:50 atlantis kernel: [838087.621490] ffff8803242dd7b0 0000000000000046 ffff8803242ddfd8 ffff8803242dc000
Jun 7 14:54:50 atlantis kernel: [838087.621493] 0000000000013d00 ffff8803242e03b8 ffff8803242ddfd8 0000000000013d00
Jun 7 14:54:50 atlantis kernel: [838087.621495] ffff8803270a5b80 ffff8803242e0000 ffff8803242dd7c0 ffff88063fff87b8
Jun 7 14:54:50 atlantis kernel: [838087.621497] Call Trace:
Jun 7 14:54:50 atlantis kernel: [838087.621504] [] __wait_on_freeing_inode+0x9d/0xd0
Jun 7 14:54:50 atlantis kernel: [838087.621508] [] ? wake_bit_function+0x0/0x50
Jun 7 14:54:50 atlantis kernel: [838087.621511] [] ? default_spin_lock_flags+0x9/0x10
Jun 7 14:54:50 atlantis kernel: [838087.621513] [] find_inode_fast+0x2b/0x90
Jun 7 14:54:50 atlantis kernel: [838087.621514] [] ifind_fast+0x3c/0xa0
Jun 7 14:54:50 atlantis kernel: [838087.621516] [] ilookup+0x52/0x60
Jun 7 14:54:50 atlantis kernel: [838087.621540] [] zfs_zget+0xbb/0x200 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621554] [] ? txg_list_add+0x5d/0x70 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621567] [] zfs_purgedir+0xc4/0x220 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621570] [] ? __mod_zone_page_state+0x49/0x50
Jun 7 14:54:50 atlantis kernel: [838087.621572] [] ? __free_pages+0x2d/0x40
Jun 7 14:54:50 atlantis kernel: [838087.621575] [] ? __free_slab+0xd5/0x190
Jun 7 14:54:50 atlantis kernel: [838087.621577] [] ? flush_tlb_page+0x48/0xb0
Jun 7 14:54:50 atlantis kernel: [838087.621580] [] ? cpumask_next_and+0x36/0x50
Jun 7 14:54:50 atlantis kernel: [838087.621588] [] ? dbuf_rele_and_unlock+0x1a9/0x260 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621596] [] ? dmu_buf_rele+0x30/0x40 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621606] [] ? dnode_rele+0x80/0x90 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621620] [] zfs_rmnode+0x202/0x310 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621634] [] zfs_zinactive+0xa4/0x120 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621647] [] zfs_inactive+0x5e/0x1b0 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621661] [] zpl_evict_inode+0x2f/0x40 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621663] [] evict+0x24/0xc0
Jun 7 14:54:50 atlantis kernel: [838087.621664] [] dispose_list+0x59/0xf0
Jun 7 14:54:50 atlantis kernel: [838087.621666] [] prune_icache+0x150/0x2f0
Jun 7 14:54:50 atlantis kernel: [838087.621667] [] ? prune_dcache+0x159/0x190
Jun 7 14:54:50 atlantis kernel: [838087.621669] [] shrink_icache_memory+0x55/0x60
Jun 7 14:54:50 atlantis kernel: [838087.621673] [] shrink_slab+0x11c/0x180
Jun 7 14:54:50 atlantis kernel: [838087.621674] [] balance_pgdat+0x2d6/0x6d0
Jun 7 14:54:50 atlantis kernel: [838087.621676] [] kswapd+0x143/0x1b0
Jun 7 14:54:50 atlantis kernel: [838087.621678] [] ? kswapd+0x0/0x1b0
Jun 7 14:54:50 atlantis kernel: [838087.621679] [] kthread+0x96/0xa0
Jun 7 14:54:50 atlantis kernel: [838087.621683] [] kernel_thread_helper+0x4/0x10
Jun 7 14:54:50 atlantis kernel: [838087.621684] [] ? kthread+0x0/0xa0
Jun 7 14:54:50 atlantis kernel: [838087.621686] [] ? kernel_thread_helper+0x0/0x10
Jun 7 14:54:50 atlantis kernel: [838087.621687] INFO: task kswapd1:95 blocked for more than 120 seconds.
...
The text was updated successfully, but these errors were encountered: