-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel BUG at /build/buildd/linux-2.6.35/fs/inode.c:1325! #180
Comments
I've tried to reproduce this problem using the following patch to disable the kernel's inode/dentry cache - which makes triggering this bug quite a bit easier (usually happens within 1-2 seconds after starting a couple of instances of 'find'):
Here's the debug output I get with my patch:
Looks like zfs_zget tried to igrab() an inode that has been cleaned up shortly before. The second problem does indeed seem to be related, here's a stack trace I got from another test run:
And the matching stack trace where that zp->z_sa_hdl was cleared:
|
I can reliably reproduce this bug without any patches to the Linux kernel. This is with the shrinker branch, or the latest master. It does not matter, the bug is the same. http://pastebin.com/GWytqvE7 is the stacktrace for the four consecutive oopses I got. After the first oops, memory use continues to balloon until it gets to about 15K free, and then the machine hard locks. Goddamn, I can't back up this machine because of this stupid bug, and my SSD is starting to fail. I do not know what I am going to do, but in a few days this will be the first time I will ever have experienced data loss as a direct consequence of using ZFS. |
Additional information: /sys/module/zfs/[email protected] α: |
I tried zfs_arc_max 256 MB. ZFS simply does not respect it. Machine starts with 1.5 GB free and during rsync it just eats up all that memory, then freezes. What is the point of these knobs if ZFS simply does not care at all to respect them? |
As far as I can tell sa_set_userp / sa_get_userdata are hiding an unreferenced pointer to the inode/znode_t. On Solaris zfs_zinactive 'resurrects' a vnode_t when the code detects that some other thread has grabbed a reference to the vnode while it was trying to clear it (http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_znode.c#1297) - this can't be done so easily on Linux as zfs_zinactive gets called by iput_final - at which point we can't just bump up the ref count again. |
Thanks for digging in to this Gunnarbeutner. The debugging output from your patch looks like it's a great start on getting this resolved. I see what your saying about having the vnode detect this race and be resurrected. One I get a little time (next week) I'll take a serious look at what's going on here and what can be done. Rudd-O these tuning will be respected eventually but until this bug is fixed enforcing the limit just makes this issue more likely. The metadata branch has had a patch to enforce the meta data limit for some time. As for the general zfs_arc_max limit it is being enforced but do to slab fragmentation the ARC may believe it only has 1GiB allocated when really 2GiB is allocated to the slabs. This is being worked on, but it all times time. |
After applying https://github.com/gunnarbeutner/pkg-zfs/commit/1b09ecd2089d7c94ece0f3fb1743a88394f348f9 I can't reproduce this problem anymore - even when running several dozen "find"s at once. (Edit: Just realized there was a missing iput() in line 880.) |
I can confirm gunnarbeutner's commit, cherry-picked and applied on top of brian's work, resolved the problem for me. I am rsyncing, running find, the only problem now, is correct memory reporting (cached data in zfs appears as used memory, not as cached, and kswapd works overtime finding new pages after this). But the system is now STABLE. |
I spoke too soon. The kernel BUG is fixed, for realz. HOWEVER, the increasing memory use still locks up the machine. Apparently, it won't lock the machine up if I just leave it running, it will just hover around ~20K free, but if I start an app during those dire critically low memory situations, the machine will just lock up. Apr 14 15:07:29 karen kernel: z_wr_int/5: page allocation failure. order:0, mode:0x20 |
Rudd-O: Make sure you're using 1b09ecd2089d7c94ece0f3fb1743a88394f348f9 rather than ab84ad20ad8f726a056acc63f1a457a31b482aae - there's a missing iput() in the "old" patch which leads to leaked inode references. |
And I've removed another superfluous iput() (in the if (ip == NULL) branch; this wouldn't cause any crashes though): a540221fd9d55fbb4f7f8b52b2106d107d237b58 |
Gunnar: cherry-picked a54. Going to test real soon. |
I can not reproduce "kernel BUG at ... inode.c" with a540221 patch. But during tests zfs ate a lot of memory, I tried to off swap in process, got several errors of "SLUB can not allocate memory", turned swap on later. And after some time it seemes to become deadlocked, because it stopeed eating processor. And when i tried to ^C on running ls, du and rsync, I got this in log (part of it): Apr 21 12:32:31 plum-old kernel: arc_reclaim D ffff880024709d10 0 2274 2 0x00000000 Apr 21 12:32:31 plum-old kernel: ffff880024709cf0 0000000000000046 0000000000000000 ffff880024709c10 Apr 21 12:32:31 plum-old kernel: 0000000000000000 0000000000000001 ffff88003f493180 0000000000013180 Apr 21 12:32:31 plum-old kernel: 0000000000013180 ffff88003ca56d00 0000000000013180 ffff880024709fd8 Apr 21 12:32:31 plum-old kernel: Call Trace: Apr 21 12:32:31 plum-old kernel: [] ? lock_timer_base.clone.23+0x36/0x70 Apr 21 12:32:31 plum-old kernel: [] __mutex_lock_slowpath+0x10a/0x290 Apr 21 12:32:31 plum-old kernel: [] ? thread_generic_wrapper+0x0/0x90 [spl] Apr 21 12:32:31 plum-old kernel: [] mutex_lock+0x11/0x30 Apr 21 12:32:31 plum-old kernel: [] arc_evict+0x67/0x600 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? spl_slab_reclaim+0x3e/0x280 [spl] Apr 21 12:32:31 plum-old kernel: [] ? thread_generic_wrapper+0x0/0x90 [spl] Apr 21 12:32:31 plum-old kernel: [] arc_adjust+0x19e/0x1e0 [zfs] Apr 21 12:32:31 plum-old kernel: [] arc_reclaim_thread+0x72/0x130 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? arc_reclaim_thread+0x0/0x130 [zfs] Apr 21 12:32:31 plum-old kernel: [] thread_generic_wrapper+0x73/0x90 [spl] Apr 21 12:32:31 plum-old kernel: [] kthread+0x96/0xa0 Apr 21 12:32:31 plum-old kernel: [] kernel_thread_helper+0x4/0x10 Apr 21 12:32:31 plum-old kernel: [] ? kthread+0x0/0xa0 Apr 21 12:32:31 plum-old kernel: [] ? kernel_thread_helper+0x0/0x10 Apr 21 12:32:31 plum-old kernel: txg_quiesce D 0000000107006dac 0 2584 2 0x00000000 Apr 21 12:32:31 plum-old kernel: ffff88000c731d90 0000000000000046 0000000000000003 ffff88000c731cb0 Apr 21 12:32:31 plum-old kernel: 0000000000000000 0000000000000000 0000000000013180 0000000000013180 Apr 21 12:32:31 plum-old kernel: 0000000000013180 ffff8800203e4420 0000000000013180 ffff88000c731fd8 Apr 21 12:32:31 plum-old kernel: Call Trace: Apr 21 12:32:31 plum-old kernel: [] ? check_preempt_curr+0x84/0xa0 Apr 21 12:32:31 plum-old kernel: [] ? try_to_wake_up+0x1a8/0x300 Apr 21 12:32:31 plum-old kernel: [] ? __mutex_lock_slowpath+0x1e7/0x290 Apr 21 12:32:31 plum-old kernel: [] cv_wait_common+0x7b/0xe0 [spl] Apr 21 12:32:31 plum-old kernel: [] ? autoremove_wake_function+0x0/0x40 Apr 21 12:32:31 plum-old kernel: [] __cv_wait+0xe/0x10 [spl] Apr 21 12:32:31 plum-old kernel: [] txg_quiesce_thread+0x1eb/0x2b0 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? txg_quiesce_thread+0x0/0x2b0 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? thread_generic_wrapper+0x0/0x90 [spl] Apr 21 12:32:31 plum-old kernel: [] thread_generic_wrapper+0x73/0x90 [spl] Apr 21 12:32:31 plum-old kernel: [] kthread+0x96/0xa0 Apr 21 12:32:31 plum-old kernel: [] kernel_thread_helper+0x4/0x10 Apr 21 12:32:31 plum-old kernel: [] ? kthread+0x0/0xa0 Apr 21 12:32:31 plum-old kernel: [] ? kernel_thread_helper+0x0/0x10 Apr 21 12:32:31 plum-old kernel: cp D ffff880030803898 0 4129 3703 0x00000004 Apr 21 12:32:31 plum-old kernel: ffff880030803878 0000000000000082 ffff88003b6b9000 ffff880030803798 Apr 21 12:32:31 plum-old kernel: ffff88003b8a0000 0000000000000246 ffff8800308037a8 0000000000013180 Apr 21 12:32:31 plum-old kernel: 0000000000013180 ffff88003bb63680 0000000000013180 ffff880030803fd8 Apr 21 12:32:31 plum-old kernel: Call Trace: Apr 21 12:32:31 plum-old kernel: [] ? cpumask_next_and+0x36/0x50 Apr 21 12:32:31 plum-old kernel: [] ? select_idle_sibling+0x95/0x160 Apr 21 12:32:31 plum-old kernel: [] ? native_sched_clock+0x15/0x70 Apr 21 12:32:31 plum-old kernel: [] __mutex_lock_slowpath+0x10a/0x290 Apr 21 12:32:31 plum-old kernel: [] ? zio_data_buf_free+0x0/0x30 [zfs] Apr 21 12:32:31 plum-old kernel: [] mutex_lock+0x11/0x30 Apr 21 12:32:31 plum-old kernel: [] arc_change_state.clone.5+0x2a0/0x2f0 [zfs] Apr 21 12:32:31 plum-old kernel: [] arc_evict+0x50f/0x600 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? default_wake_function+0xd/0x10 Apr 21 12:32:31 plum-old kernel: [] arc_get_data_buf.clone.13+0x236/0x460 [zfs] Apr 21 12:32:31 plum-old kernel: [] arc_buf_alloc+0xb8/0xf0 [zfs] Apr 21 12:32:31 plum-old kernel: [] dbuf_read+0x3cd/0x840 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? zio_destroy+0xa9/0xe0 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? dsl_pool_sync_context+0x23/0x30 [zfs] Apr 21 12:32:31 plum-old kernel: [] dmu_buf_will_dirty+0x71/0xd0 [zfs] Apr 21 12:32:31 plum-old kernel: [] dmu_write_uio_dnode+0x70/0x150 [zfs] Apr 21 12:32:31 plum-old kernel: [] dmu_write_uio_dbuf+0x44/0x60 [zfs] Apr 21 12:32:31 plum-old kernel: [] zfs_write+0xbb8/0xbe0 [zfs] Apr 21 12:32:31 plum-old kernel: [] ? __lock_page_killable+0x62/0x70 Apr 21 12:32:31 plum-old kernel: [] ? do_sync_read+0xd2/0x110 Apr 21 12:32:31 plum-old kernel: [] zpl_write_common+0x4d/0x80 [zfs] Apr 21 12:32:31 plum-old kernel: [] zpl_write+0x63/0x90 [zfs] Apr 21 12:32:31 plum-old kernel: [] vfs_write+0xc6/0x190 Apr 21 12:32:31 plum-old kernel: [] sys_write+0x4c/0x80 Apr 21 12:32:31 plum-old kernel: [] system_call_fastpath+0x16/0x1b |
Nice fix Gunnar, a540221fd9d55fbb4f7f8b52b2106d107d237b58! This looks very good, after some careful review and testing I've pulled it in to the master branch. This should help a lot, thanks for taking the time to fix this. |
openzfs#180 occurred because of a race between inode eviction and zfs_zget(). openzfs/zfs@36df284 tried to address it by making an upcall to the VFS to learn whether an inode is being evicted and spin until it leaves eviction. This is a hack around the fact that we cannot ensure "SA" does immediate eviction by hooking into generic_drop_inode(), which is GPL exported and cannot be wrapped. Unfortunately, the act of calling ilookup to avoid this race during writeback creates a deadlock: [ 602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds. [ 602.268496] Tainted: P O 3.13.6 #1 [ 602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 602.268500] kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 [ 602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) [ 602.268522] ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 [ 602.268526] ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 [ 602.268530] ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 [ 602.268534] Call Trace: [ 602.268541] [<ffffffff813dd1d4>] schedule+0x24/0x70 [ 602.268546] [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [ 602.268552] [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40 [ 602.268555] [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [ 602.268559] [<ffffffff811608c5>] ilookup+0x65/0xd0 [ 602.268590] [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [ 602.268594] [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360 [ 602.268613] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [ 602.268631] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [ 602.268634] [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10 [ 602.268651] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [ 602.268669] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [ 602.268674] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [ 602.268677] [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [ 602.268680] [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [ 602.268684] [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [ 602.268689] [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80 [ 602.268692] [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [ 602.268696] [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40 [ 602.268700] [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130 [ 602.268703] [<ffffffff8106072c>] process_one_work+0x16c/0x490 [ 602.268706] [<ffffffff810613f3>] worker_thread+0x113/0x390 [ 602.268710] [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0 [ 602.268713] [<ffffffff81066edf>] kthread+0xdf/0x100 [ 602.268717] [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0 [ 602.268720] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 [ 602.268723] [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0 [ 602.268730] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 The return value from igrab() gives us the information that ifind() provided without the risk of a deadlock. Ideally, we should ask upstream to export generic_drop_inode() so that we can wrap it to properly handle this situation, but until then, lets hook into the return value of ifind() so that we do not deadlock here. In addition, this ensures that successful exit from this function has a hold on the inode, which the code expects. Signed-off-by: Richard Yao <[email protected]> Please enter the commit message for your changes. Lines starting
openzfs#180 occurred because of a race between inode eviction and zfs_zget(). openzfs/zfs@36df284 tried to address it by making an upcall to the VFS to learn whether an inode is being evicted and spin until it leaves eviction. This is a hack around the fact that we cannot ensure "SA" does immediate eviction by hooking into generic_drop_inode(), which is GPL exported and cannot be wrapped. Unfortunately, the act of calling ilookup to avoid this race during writeback creates a deadlock: [ 602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds. [ 602.268496] Tainted: P O 3.13.6 #1 [ 602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 602.268500] kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 [ 602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) [ 602.268522] ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 [ 602.268526] ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 [ 602.268530] ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 [ 602.268534] Call Trace: [ 602.268541] [<ffffffff813dd1d4>] schedule+0x24/0x70 [ 602.268546] [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [ 602.268552] [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40 [ 602.268555] [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [ 602.268559] [<ffffffff811608c5>] ilookup+0x65/0xd0 [ 602.268590] [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [ 602.268594] [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360 [ 602.268613] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [ 602.268631] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [ 602.268634] [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10 [ 602.268651] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [ 602.268669] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [ 602.268674] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [ 602.268677] [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [ 602.268680] [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [ 602.268684] [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [ 602.268689] [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80 [ 602.268692] [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [ 602.268696] [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40 [ 602.268700] [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130 [ 602.268703] [<ffffffff8106072c>] process_one_work+0x16c/0x490 [ 602.268706] [<ffffffff810613f3>] worker_thread+0x113/0x390 [ 602.268710] [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0 [ 602.268713] [<ffffffff81066edf>] kthread+0xdf/0x100 [ 602.268717] [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0 [ 602.268720] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 [ 602.268723] [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0 [ 602.268730] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 The return value from igrab() gives us the information that ifind() provided without the risk of a deadlock. Ideally, we should ask upstream to export generic_drop_inode() so that we can wrap it to properly handle this situation, but until then, lets hook into the return value of ifind() so that we do not deadlock here. In addition, this ensures that successful exit from this function has a hold on the inode, which the code expects. Signed-off-by: Richard Yao <[email protected]> Please enter the commit message for your changes. Lines starting
openzfs#180 occurred because of a race between inode eviction and zfs_zget(). openzfs/zfs@36df284 tried to address it by making an upcall to the VFS to learn whether an inode is being evicted and spin until it leaves eviction. This is a hack around the fact that we cannot ensure "SA" does immediate eviction by hooking into generic_drop_inode(), which is GPL exported and cannot be wrapped. Unfortunately, the act of calling ilookup to avoid this race during writeback creates a deadlock: [ 602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds. [ 602.268496] Tainted: P O 3.13.6 #1 [ 602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 602.268500] kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 [ 602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) [ 602.268522] ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 [ 602.268526] ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 [ 602.268530] ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 [ 602.268534] Call Trace: [ 602.268541] [<ffffffff813dd1d4>] schedule+0x24/0x70 [ 602.268546] [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [ 602.268552] [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40 [ 602.268555] [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [ 602.268559] [<ffffffff811608c5>] ilookup+0x65/0xd0 [ 602.268590] [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [ 602.268594] [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360 [ 602.268613] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [ 602.268631] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [ 602.268634] [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10 [ 602.268651] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [ 602.268669] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [ 602.268674] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [ 602.268677] [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [ 602.268680] [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [ 602.268684] [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [ 602.268689] [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80 [ 602.268692] [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [ 602.268696] [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40 [ 602.268700] [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130 [ 602.268703] [<ffffffff8106072c>] process_one_work+0x16c/0x490 [ 602.268706] [<ffffffff810613f3>] worker_thread+0x113/0x390 [ 602.268710] [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0 [ 602.268713] [<ffffffff81066edf>] kthread+0xdf/0x100 [ 602.268717] [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0 [ 602.268720] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 [ 602.268723] [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0 [ 602.268730] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 The return value from igrab() gives us the information that ifind() provided without the risk of a deadlock. Ideally, we should ask upstream to export generic_drop_inode() so that we can wrap it to properly handle this situation, but until then, lets hook into the return value of ifind() so that we do not deadlock here. In addition, zfs_zget() should exit with a hold on the inode, but that was never present in the Linux port. iThis can lead to undefined behavior, such as inodes that are evicted when they have users. The function is modified to ensure that successful exit from this function has a hold on the inode, which the code expects. Signed-off-by: Richard Yao <[email protected]> Please enter the commit message for your changes. Lines starting
openzfs#180 occurred because of a race between inode eviction and zfs_zget(). openzfs/zfs@36df284 tried to address it by making an upcall to the VFS to learn whether an inode is being evicted and spin until it leaves eviction. This is a hack around the fact that we cannot ensure "SA" does immediate eviction by hooking into generic_drop_inode(), which is GPL exported and cannot be wrapped. Unfortunately, the act of calling ilookup to avoid this race during writeback creates a deadlock: [ 602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds. [ 602.268496] Tainted: P O 3.13.6 #1 [ 602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 602.268500] kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 [ 602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) [ 602.268522] ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 [ 602.268526] ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 [ 602.268530] ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 [ 602.268534] Call Trace: [ 602.268541] [<ffffffff813dd1d4>] schedule+0x24/0x70 [ 602.268546] [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [ 602.268552] [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40 [ 602.268555] [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [ 602.268559] [<ffffffff811608c5>] ilookup+0x65/0xd0 [ 602.268590] [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [ 602.268594] [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360 [ 602.268613] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [ 602.268631] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [ 602.268634] [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10 [ 602.268651] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [ 602.268669] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [ 602.268674] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [ 602.268677] [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [ 602.268680] [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [ 602.268684] [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [ 602.268689] [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80 [ 602.268692] [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [ 602.268696] [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40 [ 602.268700] [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130 [ 602.268703] [<ffffffff8106072c>] process_one_work+0x16c/0x490 [ 602.268706] [<ffffffff810613f3>] worker_thread+0x113/0x390 [ 602.268710] [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0 [ 602.268713] [<ffffffff81066edf>] kthread+0xdf/0x100 [ 602.268717] [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0 [ 602.268720] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 [ 602.268723] [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0 [ 602.268730] [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190 The return value from igrab() gives us the information that ifind() provided without the risk of a deadlock. Ideally, we should ask upstream to export generic_drop_inode() so that we can wrap it to properly handle this situation, but until then, lets hook into the return value of ifind() so that we do not deadlock here. In addition, zfs_zget() should exit with a hold on the inode, but that was never done in the Linux port when the inode had already been constructed. This can lead to undefined behavior, such as inodes that are evicted when they have users. The function is modified to ensure that successful exit from this function has a hold on the inode, which the code expects. Signed-off-by: Richard Yao <[email protected]> Please enter the commit message for your changes. Lines starting
openzfs#180 occurred because of a race between inode eviction and zfs_zget(). openzfs/zfs@36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manor in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#180
#180 occurred because of a race between inode eviction and zfs_zget(). 36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manner in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #180
openzfs#180 occurred because of a race between inode eviction and zfs_zget(). openzfs/zfs@36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manner in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#180
upgrade issue from 0.7 to 0.8 due to strict check on degrade_io_seq key (openzfs#180) Signed-off-by: Vishnu Itta <[email protected]>
The `active` column in `zpool iostat -oq` is usually 0, even when there are actually many active operations. The problem is that the code that displays the counts is trying to display it as a difference from the previous value.
Observed on the shrinker branch due to increased pressure on meta-data. I believe this bug exists in previous versions of the code but occurred more infrequently. It can be reproduced fairly easily on low memory systems (2GiB) by locally rsync'ing two large directory trees
It is caused by a missing inode reference somewhere. The kernel bug is indicating that we are trying to free an inode which has already been cleared.
This second stack trace may be related. It shows an inode before released mid directory lookup. Once again this could happen if the shrink_inode_cache() was running concurrently and cleared this in use inode because it didn't have the needed reference.
The text was updated successfully, but these errors were encountered: