Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel BUG at /build/buildd/linux-2.6.35/fs/inode.c:1325! #180

Closed
behlendorf opened this issue Mar 31, 2011 · 14 comments
Closed

kernel BUG at /build/buildd/linux-2.6.35/fs/inode.c:1325! #180

behlendorf opened this issue Mar 31, 2011 · 14 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

Observed on the shrinker branch due to increased pressure on meta-data. I believe this bug exists in previous versions of the code but occurred more infrequently. It can be reproduced fairly easily on low memory systems (2GiB) by locally rsync'ing two large directory trees

rsync -a /zagut/usr1/ /zagut/usr2

It is caused by a missing inode reference somewhere. The kernel bug is indicating that we are trying to free an inode which has already been cleared.

kernel BUG at /build/buildd/linux-2.6.35/fs/inode.c:1325!
Pid: 1485, comm: arc_reclaim Tainted: P            2.6.35-28-generic #49-Ubuntu X7SPA-H/X7SPA-H
RIP: 0010:[]  [] iput+0x69/0x70
Process arc_reclaim (pid: 1485, threadinfo ffff88007a246000, task ffff88007c9544a0)
Stack:
 ffff88007a247cc0 ffff8800585613c0 ffff88007a247ce0 ffffffff81166700
<0> ffff8800585613c0 ffff880069c6f000 ffff88007a247d00 ffffffff81166883
<0> ffff8800585613c0 ffff880069c6f000 ffff88007a247d20 ffffffff811669ba
Call Trace:
 [] dentry_iput+0x90/0x100
 [] d_kill+0x53/0x80
 [] prune_one_dentry+0xea/0x110
 [] __shrink_dcache_sb+0x260/0x2c0
 [] prune_dcache+0x11f/0x1d0
 [] shrink_dcache_memory+0x51/0x60
 [] arc_kmem_reap_now+0x61/0x100 [zfs] 
 [] arc_reclaim_thread+0x13a/0x280 [zfs]
 [] thread_generic_wrapper+0x81/0xe0 [spl]
 [] kthread+0x96/0xa0
 [] kernel_thread_helper+0x4/0x10

This second stack trace may be related. It shows an inode before released mid directory lookup. Once again this could happen if the shrink_inode_cache() was running concurrently and cleared this in use inode because it didn't have the needed reference.

SPLError: 20485:0:(sa.c:1428:sa_lookup()) ASSERTION(hdl) failed
SPLError: 20485:0:(sa.c:1428:sa_lookup()) SPL PANIC
SPL: Showing stack for process 20485
Pid: 20485, comm: rsync Tainted: P            2.6.35-28-generic #49-Ubuntu
Call Trace:
spl_debug_bug+0x81/0xe0 [spl]
sa_lookup+0x9f/0xb0 [zfs]
zfs_inode_update+0x4a/0x1b0 [zfs]
zfs_readdir+0x2db/0x4a0 [zfs]
zpl_readdir+0x5c/0xe0 [zfs]
vfs_readdir+0xc0/0xe0
sys_getdents+0x89/0xf0
@gunnarbeutner
Copy link
Contributor

I've tried to reproduce this problem using the following patch to disable the kernel's inode/dentry cache - which makes triggering this bug quite a bit easier (usually happens within 1-2 seconds after starting a couple of instances of 'find'):

--- fs/inode.c.orig     2011-04-10 20:31:05.000000000 +0200
+++ fs/inode.c  2011-04-10 18:56:18.000000000 +0200
@@ -107,6 +107,8 @@

 static struct kmem_cache *inode_cachep __read_mostly;

+static DEFINE_SPINLOCK(gnb_debug_lock);
+
 static int get_nr_inodes(void)
 {
        int i;
@@ -164,6 +166,11 @@
        static const struct file_operations empty_fops;
        struct address_space *const mapping = &inode->i_data;

+       spin_lock(&gnb_debug_lock);
+       printk(KERN_WARNING "inode_init_always(0x%p)\n", inode);
+       dump_stack();
+       spin_unlock(&gnb_debug_lock);
+
        inode->i_sb = sb;
        inode->i_blkbits = sb->s_blocksize_bits;
        inode->i_flags = 0;
@@ -343,6 +350,11 @@
  */
 void __iget(struct inode *inode)
 {
+       spin_lock(&gnb_debug_lock);
+       printk(KERN_WARNING "iget(0x%p)\n", inode);
+       dump_stack();
+       spin_unlock(&gnb_debug_lock);
+
        atomic_inc(&inode->i_count);
 }

@@ -351,6 +363,11 @@
  */
 void ihold(struct inode *inode)
 {
+       spin_lock(&gnb_debug_lock);
+       printk(KERN_WARNING "ihold(0x%p)\n", inode);
+       dump_stack();
+       spin_unlock(&gnb_debug_lock);
+
        WARN_ON(atomic_inc_return(&inode->i_count) < 2);
 }
 EXPORT_SYMBOL(ihold);
@@ -1385,14 +1402,14 @@
                drop = generic_drop_inode(inode);

        if (!drop) {
-               if (sb->s_flags & MS_ACTIVE) {
+               /*if (sb->s_flags & MS_ACTIVE) {
                        inode->i_state |= I_REFERENCED;
                        if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
                                inode_lru_list_add(inode);
                        }
                        spin_unlock(&inode_lock);
                        return;
-               }
+               }*/
                WARN_ON(inode->i_state & I_NEW);
                inode->i_state |= I_WILL_FREE;
                spin_unlock(&inode_lock);
@@ -1436,6 +1453,11 @@
        if (inode) {
                BUG_ON(inode->i_state & I_CLEAR);

+               spin_lock(&gnb_debug_lock);
+               printk(KERN_WARNING "iput(0x%p)\n", inode);
+               dump_stack();
+               spin_unlock(&gnb_debug_lock);
+
                if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
                        iput_final(inode);
        }
--- fs/dcache.c.orig    2011-04-10 20:30:20.000000000 +0200
+++ fs/dcache.c 2011-04-10 13:00:15.000000000 +0200
@@ -450,7 +450,7 @@
        }

        /* Unreachable? Get rid of it */
-       if (d_unhashed(dentry))
+       /*if (d_unhashed(dentry))*/
                goto kill_it;

        /* Otherwise leave it cached and ensure it's on the LRU */

Here's the debug output I get with my patch:

[  296.127585] inode_init_always(0xffff88031997dd80)
[  296.127588] Pid: 1967, comm: find Tainted: P            2.6.38-8-server #41
[  296.127591] Call Trace:
[  296.127595]  [<ffffffff8117e0e4>] ? inode_init_always+0x34/0x200
[  296.127599]  [<ffffffff8117e2ea>] ? alloc_inode+0x3a/0xa0
[  296.127602]  [<ffffffff8117e377>] ? new_inode+0x27/0x90
[  296.127636]  [<ffffffffa03d744a>] ? zfs_znode_alloc+0x4a/0x5c0 [zfs]
[  296.127669]  [<ffffffffa03e731e>] ? zio_wait+0x21e/0x3e0 [zfs]
[  296.127687]  [<ffffffffa031dd71>] ? dbuf_read+0x3d1/0xd50 [zfs]
[  296.127717]  [<ffffffffa0372922>] ? refcount_remove_many+0x182/0x250 [zfs]
[  296.127747]  [<ffffffffa0372a06>] ? refcount_remove+0x16/0x20 [zfs]
[  296.127771]  [<ffffffffa03428d8>] ? dnode_rele+0x58/0xe0 [zfs]
[  296.127775]  [<ffffffff81038c79>] ? default_spin_lock_flags+0x9/0x10
[  296.127795]  [<ffffffffa032aa6b>] ? dmu_object_info_from_dnode+0x17b/0x230 [zfs]
[  296.127826]  [<ffffffffa03da908>] ? zfs_zget+0x108/0x280 [zfs]
[  296.127852]  [<ffffffffa03af0e1>] ? zap_lookup_norm+0xd1/0x1b0 [zfs]
[  296.127858]  [<ffffffffa025a0cb>] ? kmem_alloc_debug+0x27b/0x3a0 [spl]
[  296.127884]  [<ffffffffa03b583c>] ? zfs_dirent_lock+0x56c/0x720 [zfs]
[  296.127910]  [<ffffffffa03b1d21>] ? zfs_zaccess_aces_check+0x1a1/0x2f0 [zfs]
[  296.127936]  [<ffffffffa03b5a64>] ? zfs_dirlook+0x74/0x2c0 [zfs]
[  296.127962]  [<ffffffffa03b3aa3>] ? zfs_zaccess+0xa3/0x290 [zfs]
[  296.127986]  [<ffffffffa0372e7d>] ? rrw_enter+0xad/0x350 [zfs]
[  296.128012]  [<ffffffffa03d2972>] ? zfs_lookup+0x2e2/0x330 [zfs]
[  296.128037]  [<ffffffffa03ef5c8>] ? zpl_lookup+0x58/0x120 [zfs]
[  296.128041]  [<ffffffff8116f505>] ? d_alloc_and_lookup+0x45/0x90
[  296.128043]  [<ffffffff8117c7c5>] ? d_lookup+0x35/0x60
[  296.128046]  [<ffffffff81171732>] ? do_lookup+0x182/0x2e0
[  296.128049]  [<ffffffff811817fe>] ? vfsmount_lock_local_unlock+0x1e/0x30
[  296.128052]  [<ffffffff811720a6>] ? link_path_walk+0x656/0xc40
[  296.128075]  [<ffffffffa03731bb>] ? rrw_exit+0x9b/0x290 [zfs]
[  296.128091]  [<ffffffffa0329ab0>] ? dmu_object_size_from_db+0x70/0xa0 [zfs]
[  296.128094]  [<ffffffff811817ce>] ? vfsmount_lock_local_lock+0x1e/0x30
[  296.128097]  [<ffffffff8116fbf5>] ? path_init_rcu+0xa5/0x200
[  296.128100]  [<ffffffff8117298b>] ? do_path_lookup+0x5b/0x160
[  296.128103]  [<ffffffff81172cf7>] ? user_path_at+0x57/0xa0
[  296.128128]  [<ffffffffa03ef7a8>] ? zpl_getattr+0x118/0x1b0 [zfs]
[  296.128131]  [<ffffffff811817fe>] ? vfsmount_lock_local_unlock+0x1e/0x30
[  296.128134]  [<ffffffff81169a89>] ? vfs_fstatat+0x39/0x70
[  296.128137]  [<ffffffff811666f3>] ? __fput+0x143/0x1f0
[  296.128140]  [<ffffffff81169dba>] ? sys_newfstatat+0x1a/0x40
[  296.128142]  [<ffffffff811667c5>] ? fput+0x25/0x30
[  296.128145]  [<ffffffff811630e0>] ? filp_close+0x60/0x90
[  296.128147]  [<ffffffff811638e7>] ? sys_close+0xb7/0x120
[  296.128150]  [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b

[  296.128638] iput(0xffff88031997dd80)
[  296.128642] Pid: 1967, comm: find Tainted: P            2.6.38-8-server #41
[  296.128644] Call Trace:
[  296.128648]  [<ffffffff8117e61e>] ? iput+0x3e/0x80
[  296.128652]  [<ffffffff8117ae40>] ? d_kill+0x100/0x140
[  296.128655]  [<ffffffff8117af42>] ? dput+0xc2/0x100
[  296.128659]  [<ffffffff8116f3ea>] ? path_put+0x1a/0x30
[  296.128663]  [<ffffffff81169ab4>] ? vfs_fstatat+0x64/0x70
[  296.128666]  [<ffffffff81169dba>] ? sys_newfstatat+0x1a/0x40
[  296.128671]  [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b

[  296.128674] iget(0xffff88031997dd80)
[  296.128677] Pid: 1973, comm: find Tainted: P            2.6.38-8-server #41
[  296.128679] Call Trace:
[  296.128683]  [<ffffffff8117ec43>] ? __iget+0x33/0x50
[  296.128687]  [<ffffffff8117f9f8>] ? igrab+0x48/0x50
[  296.128719]  [<ffffffffa03daa04>] ? zfs_zget+0x204/0x280 [zfs]
[  296.128750]  [<ffffffffa03af0e1>] ? zap_lookup_norm+0xd1/0x1b0 [zfs]
[  296.128782]  [<ffffffffa03b583c>] ? zfs_dirent_lock+0x56c/0x720 [zfs]
[  296.128813]  [<ffffffffa03b1d21>] ? zfs_zaccess_aces_check+0x1a1/0x2f0 [zfs]
[  296.128845]  [<ffffffffa03b5a64>] ? zfs_dirlook+0x74/0x2c0 [zfs]
[  296.128876]  [<ffffffffa03b3aa3>] ? zfs_zaccess+0xa3/0x290 [zfs]
[  296.128916]  [<ffffffffa0372e7d>] ? rrw_enter+0xad/0x350 [zfs]
[  296.128947]  [<ffffffffa03d2972>] ? zfs_lookup+0x2e2/0x330 [zfs]
[  296.128993]  [<ffffffffa03ef5c8>] ? zpl_lookup+0x58/0x120 [zfs]
[  296.128998]  [<ffffffff8116f505>] ? d_alloc_and_lookup+0x45/0x90
[  296.129001]  [<ffffffff8117c7c5>] ? d_lookup+0x35/0x60
[  296.129005]  [<ffffffff81171732>] ? do_lookup+0x182/0x2e0
[  296.129009]  [<ffffffff811817fe>] ? vfsmount_lock_local_unlock+0x1e/0x30
[  296.129014]  [<ffffffff811720a6>] ? link_path_walk+0x656/0xc40
[  296.129042]  [<ffffffffa03731bb>] ? rrw_exit+0x9b/0x290 [zfs]
[  296.129062]  [<ffffffffa0329ab0>] ? dmu_object_size_from_db+0x70/0xa0 [zfs]
[  296.129067]  [<ffffffff811817ce>] ? vfsmount_lock_local_lock+0x1e/0x30
[  296.129070]  [<ffffffff8116fbf5>] ? path_init_rcu+0xa5/0x200
[  296.129074]  [<ffffffff8117298b>] ? do_path_lookup+0x5b/0x160
[  296.129078]  [<ffffffff81172cf7>] ? user_path_at+0x57/0xa0
[  296.129110]  [<ffffffffa03ef7a8>] ? zpl_getattr+0x118/0x1b0 [zfs]
[  296.129114]  [<ffffffff811817fe>] ? vfsmount_lock_local_unlock+0x1e/0x30
[  296.129119]  [<ffffffff81169a89>] ? vfs_fstatat+0x39/0x70
[  296.129122]  [<ffffffff811666f3>] ? __fput+0x143/0x1f0
[  296.129126]  [<ffffffff81169dba>] ? sys_newfstatat+0x1a/0x40
[  296.129129]  [<ffffffff811667c5>] ? fput+0x25/0x30
[  296.129133]  [<ffffffff811630e0>] ? filp_close+0x60/0x90
[  296.129137]  [<ffffffff811638e7>] ? sys_close+0xb7/0x120
[  296.129140]  [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b

[  296.129149] iput(0xffff88031997dd80)
[  296.129152] Pid: 1973, comm: find Tainted: P            2.6.38-8-server #41
[  296.129155] Call Trace:
[  296.129158] ------------[ cut here ]------------
[  296.129161] kernel BUG at /root/linux-2.6.38/fs/inode.c:1454!
[  296.129163] invalid opcode: 0000 [#1] SMP
[  296.129164] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
[  296.129166] CPU 3
[  296.129167] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate nouveau ttm drm_kms_helper drm i7core_edac edac_core i2c_algo_bit video lp parport multipath linear aacraid 3w_9xxx 3w_xxxx raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 usbhid hid raid0 sata_nv sata_sil sata_via ahci r8169 libahci
[  296.129184]
[  296.129186] Pid: 1969, comm: find Tainted: P            2.6.38-8-server #41 MSI MS-7522/MSI X58 Pro-E (MS-7522)
[  296.129189] RIP: 0010:[<ffffffff8117e652>]  [<ffffffff8117e652>] iput+0x72/0x80
[  296.129194] RSP: 0018:ffff88031a59be18  EFLAGS: 00010202
[  296.129195] RAX: 0000000000000000 RBX: ffff8803192ce178 RCX: ffff8803192ce210
[  296.129197] RDX: ffff8803192ce210 RSI: 0000000000000001 RDI: ffff8803192ce178
[  296.129198] RBP: ffff88031a59be28 R08: dead000000200200 R09: 0000000000000000
[  296.129200] R10: ffff88033f810400 R11: 0000000000000000 R12: ffff8803192ce178
[  296.129202] R13: ffff880321ca0a80 R14: ffff8803192ce178 R15: 0000000002365cc0
[  296.129204] FS:  00007faee67d1720(0000) GS:ffff8800bf460000(0000) knlGS:0000000000000000
[  296.129206] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  296.129207] CR2: 00000000023730a8 CR3: 0000000322bf6000 CR4: 00000000000006e0
[  296.129209] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  296.129211] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  296.129213] Process find (pid: 1969, threadinfo ffff88031a59a000, task ffff8803196816e0)
[  296.129214] Stack:
[  296.129215]  0000000200000000 ffff880324605b40 ffff88031a59be58 ffffffff8117ae40
[  296.129217]  ffff88031a59be68 ffff880324605b40 ffff880324605b9c ffff880321ca0a80
[  296.129220]  ffff88031a59be88 ffffffff8117af42 ffff88031a59bec8 0000000000000100
[  296.129222] Call Trace:
[  296.129224]  [<ffffffff8117ae40>] d_kill+0x100/0x140
[  296.129227]  [<ffffffff8117af42>] dput+0xc2/0x100
[  296.129229]  [<ffffffff8116f3ea>] path_put+0x1a/0x30
[  296.129232]  [<ffffffff81169ab4>] vfs_fstatat+0x64/0x70
[  296.129234]  [<ffffffff81169dba>] sys_newfstatat+0x1a/0x40
[  296.129237]  [<ffffffff8100bfc2>] system_call_fastpath+0x16/0x1b
[  296.129239] Code: a5 eb ff 66 90 48 8d bb b0 00 00 00 48 c7 c6 40 f0 cd 81 e8 71 ba 15 00 85 c0 74 08 48 89 df e8 95 fd ff ff 48 83 c4 08 5b c9 c3 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 31 c0 b9 15 00 00
[  296.129254] RIP  [<ffffffff8117e652>] iput+0x72/0x80
[  296.129257]  RSP <ffff88031a59be18>

Looks like zfs_zget tried to igrab() an inode that has been cleaned up shortly before.

The second problem does indeed seem to be related, here's a stack trace I got from another test run:

[  178.542093] SPL: Showing stack for process 1806
[  178.542099] Pid: 1806, comm: find Tainted: P            2.6.38-8-server #41
[  178.542101] Call Trace:
[  178.542114]  [<ffffffffa0261617>] ? spl_debug_dumpstack+0x27/0x40 [spl]
[  178.542121]  [<ffffffffa02629c1>] ? spl_debug_bug+0x81/0xe0 [spl]
[  178.542159]  [<ffffffffa037945f>] ? sa_lookup+0x9f/0xb0 [zfs]
[  178.542193]  [<ffffffffa03b7ade>] ? zfs_dirlook+0xee/0x2c0 [zfs]
[  178.542228]  [<ffffffffa03b5aa3>] ? zfs_zaccess+0xa3/0x290 [zfs]
[  178.542263]  [<ffffffffa03da221>] ? zfs_inode_update+0x51/0x1c0 [zfs]
[  178.542294]  [<ffffffffa0374e7d>] ? rrw_enter+0xad/0x350 [zfs]
[  178.542329]  [<ffffffffa03d4995>] ? zfs_lookup+0x305/0x330 [zfs]
[  178.542363]  [<ffffffffa03f1598>] ? zpl_lookup+0x58/0x120 [zfs]
[  178.542368]  [<ffffffff8116f505>] ? d_alloc_and_lookup+0x45/0x90
[  178.542372]  [<ffffffff8117c875>] ? d_lookup+0x35/0x60
[  178.542376]  [<ffffffff81171732>] ? do_lookup+0x182/0x2e0
[  178.542380]  [<ffffffff811817ce>] ? vfsmount_lock_local_unlock+0x1e/0x30
[  178.542384]  [<ffffffff811720a6>] ? link_path_walk+0x656/0xc40
[  178.542388]  [<ffffffff8118179e>] ? vfsmount_lock_local_lock+0x1e/0x30
[  178.542392]  [<ffffffff8116fbf5>] ? path_init_rcu+0xa5/0x200
[  178.542396]  [<ffffffff8117298b>] ? do_path_lookup+0x5b/0x160
[  178.542400]  [<ffffffff81172cf7>] ? user_path_at+0x57/0xa0
[  178.542404]  [<ffffffff8119e09d>] ? fsnotify+0x1cd/0x2e0
[  178.542409]  [<ffffffff8138937b>] ? put_ldisc+0x5b/0xc0
[  178.542414]  [<ffffffff81169a89>] ? vfs_fstatat+0x39/0x70
[  178.542417]  [<ffffffff81169dba>] ? sys_newfstatat+0x1a/0x40
[  178.542421]  [<ffffffff81165343>] ? vfs_write+0x123/0x180
[  178.542424]  [<ffffffff81165601>] ? sys_write+0x51/0x90
[  178.542429]  [<ffffffff8100bfc2>] ? system_call_fastpath+0x16/0x1b
[  178.542625] SPL: Dumping log to /tmp/spl-log.1302399011.1806

And the matching stack trace where that zp->z_sa_hdl was cleared:

[  178.542666] SPL: Showing stack for process 1178
[  178.542670] Pid: 1178, comm: arc_reclaim Tainted: P            2.6.38-8-server #41
[  178.542673] Call Trace:
[  178.542682]  [<ffffffffa0261617>] ? spl_debug_dumpstack+0x27/0x40 [spl]
[  178.542720]  [<ffffffffa03d9fc7>] ? zfs_znode_dmu_fini+0x97/0x150 [zfs]
[  178.542754]  [<ffffffffa03dd050>] ? zfs_zinactive+0xe0/0x160 [zfs]
[  178.542789]  [<ffffffffa03d618e>] ? zfs_inactive+0x6e/0x360 [zfs]
[  178.542823]  [<ffffffffa03f25cf>] ? zpl_evict_inode+0x2f/0x40 [zfs]
[  178.542827]  [<ffffffff8117deb4>] ? evict+0x24/0xc0
[  178.542831]  [<ffffffff8117e717>] ? iput_final+0x177/0x200
[  178.542834]  [<ffffffff8117e7db>] ? iput+0x3b/0x50
[  178.542838]  [<ffffffff8117ae40>] ? d_kill+0x100/0x140
[  178.542841]  [<ffffffff8117affd>] ? shrink_dentry_list+0x17d/0x1f0
[  178.542844]  [<ffffffff8117b21e>] ? __shrink_dcache_sb+0x1ae/0x1e0
[  178.542848]  [<ffffffff8117b34b>] ? prune_dcache+0xfb/0x190
[  178.542852]  [<ffffffff8117b435>] ? shrink_dcache_memory+0x55/0x60
[  178.542869]  [<ffffffffa03150f0>] ? arc_kmem_reap_now+0x60/0x100 [zfs]
[  178.542886]  [<ffffffffa03153a2>] ? arc_reclaim_thread+0x152/0x290 [zfs]
[  178.542904]  [<ffffffffa0315250>] ? arc_reclaim_thread+0x0/0x290 [zfs]
[  178.542912]  [<ffffffffa026ab61>] ? thread_generic_wrapper+0x81/0xe0 [spl]
[  178.542921]  [<ffffffffa026aae0>] ? thread_generic_wrapper+0x0/0xe0 [spl]
[  178.542925]  [<ffffffff810871f6>] ? kthread+0x96/0xa0
[  178.542929]  [<ffffffff8100cde4>] ? kernel_thread_helper+0x4/0x10
[  178.542933]  [<ffffffff81087160>] ? kthread+0x0/0xa0
[  178.542936]  [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 11, 2011

I can reliably reproduce this bug without any patches to the Linux kernel. This is with the shrinker branch, or the latest master. It does not matter, the bug is the same.

http://pastebin.com/GWytqvE7 is the stacktrace for the four consecutive oopses I got. After the first oops, memory use continues to balloon until it gets to about 15K free, and then the machine hard locks.

Goddamn, I can't back up this machine because of this stupid bug, and my SSD is starting to fail. I do not know what I am going to do, but in a few days this will be the first time I will ever have experienced data loss as a direct consequence of using ZFS.

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 11, 2011

Additional information:

/sys/module/zfs/[email protected] α:
for a in * ; do echo -n $a " " ; cat $a ; done
arc_reduce_dnlc_percent 3
spa_config_path /etc/zfs/zpool.cache
zevent_cols 80
zevent_console 0
zevent_len_max 64
zfs_arc_max 0
zfs_arc_meta_limit 0
zfs_arc_min 0
zfs_prefetch_disable 0
zfs_vdev_aggregation_limit 131072
zfs_vdev_max_pending 10
zfs_vdev_min_pending 4
zfs_vdev_scheduler noop
zio_bulk_flags 0
zio_delay_max 30000

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 11, 2011

I tried zfs_arc_max 256 MB. ZFS simply does not respect it. Machine starts with 1.5 GB free and during rsync it just eats up all that memory, then freezes. What is the point of these knobs if ZFS simply does not care at all to respect them?

@gunnarbeutner
Copy link
Contributor

As far as I can tell sa_set_userp / sa_get_userdata are hiding an unreferenced pointer to the inode/znode_t. On Solaris zfs_zinactive 'resurrects' a vnode_t when the code detects that some other thread has grabbed a reference to the vnode while it was trying to clear it (http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zfs_znode.c#1297) - this can't be done so easily on Linux as zfs_zinactive gets called by iput_final - at which point we can't just bump up the ref count again.

@behlendorf
Copy link
Contributor Author

Thanks for digging in to this Gunnarbeutner. The debugging output from your patch looks like it's a great start on getting this resolved. I see what your saying about having the vnode detect this race and be resurrected. One I get a little time (next week) I'll take a serious look at what's going on here and what can be done.

Rudd-O these tuning will be respected eventually but until this bug is fixed enforcing the limit just makes this issue more likely. The metadata branch has had a patch to enforce the meta data limit for some time. As for the general zfs_arc_max limit it is being enforced but do to slab fragmentation the ARC may believe it only has 1GiB allocated when really 2GiB is allocated to the slabs. This is being worked on, but it all times time.

@gunnarbeutner
Copy link
Contributor

After applying https://github.com/gunnarbeutner/pkg-zfs/commit/1b09ecd2089d7c94ece0f3fb1743a88394f348f9 I can't reproduce this problem anymore - even when running several dozen "find"s at once.

(Edit: Just realized there was a missing iput() in line 880.)

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 14, 2011

I can confirm gunnarbeutner's commit, cherry-picked and applied on top of brian's work, resolved the problem for me. I am rsyncing, running find, the only problem now, is correct memory reporting (cached data in zfs appears as used memory, not as cached, and kswapd works overtime finding new pages after this). But the system is now STABLE.

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 14, 2011

I spoke too soon. The kernel BUG is fixed, for realz. HOWEVER, the increasing memory use still locks up the machine. Apparently, it won't lock the machine up if I just leave it running, it will just hover around ~20K free, but if I start an app during those dire critically low memory situations, the machine will just lock up.

Apr 14 15:07:29 karen kernel: z_wr_int/5: page allocation failure. order:0, mode:0x20
Apr 14 15:07:29 karen kernel: Pid: 1073, comm: z_wr_int/5 Tainted: P 2.6.38karen.dragonfear #1
Apr 14 15:07:29 karen kernel: Call Trace:
Apr 14 15:07:30 karen kernel: [] ? __alloc_pages_nodemask+0x677/0x6f6
Apr 14 15:07:30 karen kernel: [] ? resched_task+0x43/0x6d
Apr 14 15:07:30 karen kernel: [] ? alloc_pages_current+0xac/0xc5
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? alloc_slab_page+0x17/0x45
Apr 14 15:07:30 karen kernel: [] ? new_slab+0x68/0x1a2
Apr 14 15:07:30 karen kernel: [] ? __slab_alloc+0x21f/0x2fd
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? kmem_cache_alloc+0x56/0xde
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_host_alloc_command+0x1c/0x67
Apr 14 15:07:30 karen kernel: [] ? __scsi_get_command+0x19/0x94
Apr 14 15:07:30 karen kernel: [] ? scsi_get_command+0x38/0xa0
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? scsi_setup_fs_cmnd+0x69/0xb9
Apr 14 15:07:30 karen kernel: [] ? sd_prep_fn+0x282/0xb91
Apr 14 15:07:30 karen kernel: [] ? mempool_alloc+0x67/0x119
Apr 14 15:07:30 karen kernel: [] ? blk_peek_request+0xe0/0x1b8
Apr 14 15:07:30 karen kernel: [] ? scsi_request_fn+0x7a/0x412
Apr 14 15:07:30 karen kernel: [] ? __generic_unplug_device+0x32/0x36
Apr 14 15:07:30 karen kernel: [] ? elv_insert+0x172/0x17b
Apr 14 15:07:30 karen kernel: [] ? __elv_add_request+0x6e/0x75
Apr 14 15:07:30 karen kernel: [] ? __make_request+0x353/0x3ce
Apr 14 15:07:30 karen kernel: [] ? try_to_freeze+0xf/0x26
Apr 14 15:07:30 karen kernel: [] ? generic_make_request+0x285/0x2fe
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? submit_bio+0xdb/0xfa
Apr 14 15:07:30 karen kernel: [] ? bio_alloc_bioset+0x71/0xbf
Apr 14 15:07:30 karen kernel: [] ? __vdev_disk_physio+0x27f/0x2fc [zfs]
Apr 14 15:07:30 karen kernel: [] ? vdev_disk_io_start+0xcf/0xe7 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_start+0x217/0x22a [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_nowait+0xe4/0x10e [zfs]
Apr 14 15:07:30 karen kernel: [] ? wake_up_process+0x10/0x12
Apr 14 15:07:30 karen kernel: [] ? vdev_queue_io_done+0x76/0xad [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_done+0x6c/0x157 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_execute+0xc1/0xeb [zfs]
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x2db/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? default_wake_function+0x0/0xf
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x0/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? kthread+0x7d/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x4/0x10
Apr 14 15:07:30 karen kernel: [] ? kthread+0x0/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x0/0x10
Apr 14 15:07:30 karen kernel: Mem-Info:
Apr 14 15:07:30 karen kernel: Node 0 DMA per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: CPU 1: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: Node 0 DMA32 per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 186, btch: 31 usd: 174
Apr 14 15:07:30 karen kernel: CPU 1: hi: 186, btch: 31 usd: 14
Apr 14 15:07:30 karen kernel: active_anon:89437 inactive_anon:92346 isolated_anon:0
Apr 14 15:07:30 karen kernel: active_file:8904 inactive_file:52945 isolated_file:0
Apr 14 15:07:30 karen kernel: unevictable:1045 dirty:20 writeback:0 unstable:0
Apr 14 15:07:30 karen kernel: free:476 slab_reclaimable:12803 slab_unreclaimable:45413
Apr 14 15:07:30 karen kernel: mapped:25626 shmem:99295 pagetables:7389 bounce:0
Apr 14 15:07:30 karen kernel: Node 0 DMA free:1872kB min:36kB low:44kB high:52kB active_anon:72kB inactive_anon:1016kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:1088kB slab_reclaimable:160kB slab_unreclaimable:344kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 2884 2884 2884
Apr 14 15:07:30 karen kernel: Node 0 DMA32 free:32kB min:6852kB low:8564kB high:10276kB active_anon:357676kB inactive_anon:368368kB active_file:35616kB inactive_file:211780kB unevictable:4180kB isolated(anon):0kB isolated(file):0kB present:2953592kB mlocked:4180kB dirty:80kB writeback:0kB mapped:102504kB shmem:396092kB slab_reclaimable:51052kB slab_unreclaimable:181308kB kernel_stack:4048kB pagetables:29556kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 0 0 0
Apr 14 15:07:30 karen kernel: Node 0 DMA: 0_4kB 0_8kB 1_16kB 0_32kB 1_64kB 0_128kB 1_256kB 1_512kB 1_1024kB 0_2048kB 0_4096kB = 1872kB
Apr 14 15:07:30 karen kernel: Node 0 DMA32: 0_4kB 0_8kB 0_16kB 0_32kB 0_64kB 0_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 0kB
Apr 14 15:07:30 karen kernel: 161830 total pagecache pages
Apr 14 15:07:30 karen kernel: 1 pages in swap cache
Apr 14 15:07:30 karen kernel: Swap cache stats: add 1, delete 0, find 0/0
Apr 14 15:07:30 karen kernel: Free swap = 4194640kB
Apr 14 15:07:30 karen kernel: Total swap = 4194644kB
Apr 14 15:07:30 karen kernel: 752880 pages RAM
Apr 14 15:07:30 karen kernel: 13522 pages reserved
Apr 14 15:07:30 karen kernel: 131626 pages shared
Apr 14 15:07:30 karen kernel: 615661 pages non-shared
Apr 14 15:07:30 karen kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Apr 14 15:07:30 karen kernel: cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
Apr 14 15:07:30 karen kernel: node 0: slabs: 1150, objs: 36800, free: 0
Apr 14 15:07:30 karen kernel: z_wr_int/5: page allocation failure. order:0, mode:0x20
Apr 14 15:07:30 karen kernel: Pid: 1073, comm: z_wr_int/5 Tainted: P 2.6.38karen.dragonfear #1
Apr 14 15:07:30 karen kernel: Call Trace:
Apr 14 15:07:30 karen kernel: [] ? __alloc_pages_nodemask+0x677/0x6f6
Apr 14 15:07:30 karen kernel: [] ? resched_task+0x43/0x6d
Apr 14 15:07:30 karen kernel: [] ? alloc_pages_current+0xac/0xc5
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? alloc_slab_page+0x17/0x45
Apr 14 15:07:30 karen kernel: [] ? new_slab+0x68/0x1a2
Apr 14 15:07:30 karen kernel: [] ? __slab_alloc+0x21f/0x2fd
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? kmem_cache_alloc+0x56/0xde
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_host_alloc_command+0x1c/0x67
Apr 14 15:07:30 karen kernel: [] ? __scsi_get_command+0x19/0x94
Apr 14 15:07:30 karen kernel: [] ? scsi_get_command+0x38/0xa0
Apr 14 15:07:30 karen kernel: [] ? scsi_setup_fs_cmnd+0x69/0xb9
Apr 14 15:07:30 karen kernel: [] ? sd_prep_fn+0x282/0xb91
Apr 14 15:07:30 karen kernel: [] ? __ata_scsi_queuecmd+0x189/0x1db
Apr 14 15:07:30 karen kernel: [] ? blk_peek_request+0xe0/0x1b8
Apr 14 15:07:30 karen kernel: [] ? scsi_request_fn+0x7a/0x412
Apr 14 15:07:30 karen kernel: [] ? __generic_unplug_device+0x32/0x36
Apr 14 15:07:30 karen kernel: [] ? elv_insert+0x172/0x17b
Apr 14 15:07:30 karen kernel: [] ? __elv_add_request+0x6e/0x75
Apr 14 15:07:30 karen kernel: [] ? __make_request+0x353/0x3ce
Apr 14 15:07:30 karen kernel: [] ? try_to_freeze+0xf/0x26
Apr 14 15:07:30 karen kernel: [] ? generic_make_request+0x285/0x2fe
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? submit_bio+0xdb/0xfa
Apr 14 15:07:30 karen kernel: [] ? bio_alloc_bioset+0x71/0xbf
Apr 14 15:07:30 karen kernel: [] ? __vdev_disk_physio+0x27f/0x2fc [zfs]
Apr 14 15:07:30 karen kernel: [] ? vdev_disk_io_start+0xcf/0xe7 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_start+0x217/0x22a [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_nowait+0xe4/0x10e [zfs]
Apr 14 15:07:30 karen kernel: [] ? wake_up_process+0x10/0x12
Apr 14 15:07:30 karen kernel: [] ? vdev_queue_io_done+0x76/0xad [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_done+0x6c/0x157 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_execute+0xc1/0xeb [zfs]
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x2db/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? default_wake_function+0x0/0xf
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x0/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? kthread+0x7d/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x4/0x10
Apr 14 15:07:30 karen kernel: [] ? kthread+0x0/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x0/0x10
Apr 14 15:07:30 karen kernel: Mem-Info:
Apr 14 15:07:30 karen kernel: Node 0 DMA per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: CPU 1: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: Node 0 DMA32 per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 186, btch: 31 usd: 161
Apr 14 15:07:30 karen kernel: CPU 1: hi: 186, btch: 31 usd: 14
Apr 14 15:07:30 karen kernel: active_anon:89437 inactive_anon:92346 isolated_anon:0
Apr 14 15:07:30 karen kernel: active_file:8904 inactive_file:52888 isolated_file:0
Apr 14 15:07:30 karen kernel: unevictable:1045 dirty:20 writeback:0 unstable:0
Apr 14 15:07:30 karen kernel: free:538 slab_reclaimable:12803 slab_unreclaimable:45413
Apr 14 15:07:30 karen kernel: mapped:25626 shmem:99295 pagetables:7389 bounce:0
Apr 14 15:07:30 karen kernel: Node 0 DMA free:1872kB min:36kB low:44kB high:52kB active_anon:72kB inactive_anon:1016kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:1088kB slab_reclaimable:160kB slab_unreclaimable:344kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 2884 2884 2884
Apr 14 15:07:30 karen kernel: Node 0 DMA32 free:280kB min:6852kB low:8564kB high:10276kB active_anon:357676kB inactive_anon:368368kB active_file:35616kB inactive_file:211552kB unevictable:4180kB isolated(anon):0kB isolated(file):0kB present:2953592kB mlocked:4180kB dirty:80kB writeback:0kB mapped:102504kB shmem:396092kB slab_reclaimable:51052kB slab_unreclaimable:181308kB kernel_stack:4048kB pagetables:29556kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 0 0 0
Apr 14 15:07:30 karen kernel: Node 0 DMA: 0_4kB 0_8kB 1_16kB 0_32kB 1_64kB 0_128kB 1_256kB 1_512kB 1_1024kB 0_2048kB 0_4096kB = 1872kB
Apr 14 15:07:30 karen kernel: Node 0 DMA32: 8_4kB 7_8kB 4_16kB 1_32kB 1_64kB 0_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 248kB
Apr 14 15:07:30 karen kernel: 161793 total pagecache pages
Apr 14 15:07:30 karen kernel: 1 pages in swap cache
Apr 14 15:07:30 karen kernel: Swap cache stats: add 1, delete 0, find 0/0
Apr 14 15:07:30 karen kernel: Free swap = 4194640kB
Apr 14 15:07:30 karen kernel: Total swap = 4194644kB
Apr 14 15:07:30 karen kernel: 752880 pages RAM
Apr 14 15:07:30 karen kernel: 13522 pages reserved
Apr 14 15:07:30 karen kernel: 131626 pages shared
Apr 14 15:07:30 karen kernel: 615629 pages non-shared
Apr 14 15:07:30 karen kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Apr 14 15:07:30 karen kernel: cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
Apr 14 15:07:30 karen kernel: node 0: slabs: 1150, objs: 36800, free: 0
Apr 14 15:07:30 karen kernel: z_wr_int/5: page allocation failure. order:0, mode:0x20
Apr 14 15:07:30 karen kernel: Pid: 1073, comm: z_wr_int/5 Tainted: P 2.6.38karen.dragonfear #1
Apr 14 15:07:30 karen kernel: Call Trace:
Apr 14 15:07:30 karen kernel: [] ? __alloc_pages_nodemask+0x677/0x6f6
Apr 14 15:07:30 karen kernel: [] ? alloc_pages_current+0xac/0xc5
Apr 14 15:07:30 karen kernel: [] ? alloc_slab_page+0x17/0x45
Apr 14 15:07:30 karen kernel: [] ? new_slab+0x68/0x1a2
Apr 14 15:07:30 karen kernel: [] ? __slab_alloc+0x21f/0x2fd
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? kmem_cache_alloc+0x56/0xde
Apr 14 15:07:30 karen kernel: [] ? resched_task+0x37/0x6d
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_host_alloc_command+0x1c/0x67
Apr 14 15:07:30 karen kernel: [] ? timerqueue_add+0x89/0xa8
Apr 14 15:07:30 karen kernel: [] ? __scsi_get_command+0x19/0x94
Apr 14 15:07:30 karen kernel: [] ? scsi_get_command+0x38/0xa0
Apr 14 15:07:30 karen kernel: [] ? tick_dev_program_event+0x31/0xef
Apr 14 15:07:30 karen kernel: [] ? scsi_setup_fs_cmnd+0x69/0xb9
Apr 14 15:07:30 karen kernel: [] ? sd_prep_fn+0x282/0xb91
Apr 14 15:07:30 karen kernel: [] ? irq_exit+0x48/0x78
Apr 14 15:07:30 karen kernel: [] ? blk_peek_request+0xe0/0x1b8
Apr 14 15:07:30 karen kernel: [] ? scsi_request_fn+0x7a/0x412
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? __blk_run_queue+0x45/0x75
Apr 14 15:07:30 karen kernel: [] ? blk_run_queue+0x23/0x37
Apr 14 15:07:30 karen kernel: [] ? scsi_run_queue+0x2e6/0x323
Apr 14 15:07:30 karen kernel: [] ? scsi_next_command+0x36/0x46
Apr 14 15:07:30 karen kernel: [] ? scsi_io_completion+0x3ec/0x43b
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? scsi_finish_command+0xe0/0xe9
Apr 14 15:07:30 karen kernel: [] ? scsi_softirq_done+0x104/0x10d
Apr 14 15:07:30 karen kernel: [] ? blk_done_softirq+0x6e/0x82
Apr 14 15:07:30 karen kernel: [] ? __do_softirq+0xc1/0x181
Apr 14 15:07:30 karen kernel: [] ? tick_program_event+0x1a/0x1c
Apr 14 15:07:30 karen kernel: [] ? call_softirq+0x1c/0x30
Apr 14 15:07:30 karen kernel: [] ? do_softirq+0x41/0x7c
Apr 14 15:07:30 karen kernel: [] ? irq_exit+0x36/0x78
Apr 14 15:07:30 karen kernel: [] ? smp_apic_timer_interrupt+0x79/0x87
Apr 14 15:07:30 karen kernel: [] ? apic_timer_interrupt+0x13/0x20
Apr 14 15:07:30 karen kernel: [] ? spin_unlock_irq+0xe/0x10
Apr 14 15:07:30 karen kernel: [] ? scsi_request_fn+0x358/0x412
Apr 14 15:07:30 karen kernel: [] ? __generic_unplug_device+0x32/0x36
Apr 14 15:07:30 karen kernel: [] ? elv_insert+0x172/0x17b
Apr 14 15:07:30 karen kernel: [] ? __elv_add_request+0x6e/0x75
Apr 14 15:07:30 karen kernel: [] ? __make_request+0x353/0x3ce
Apr 14 15:07:30 karen kernel: [] ? try_to_freeze+0xf/0x26
Apr 14 15:07:30 karen kernel: [] ? generic_make_request+0x285/0x2fe
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? submit_bio+0xdb/0xfa
Apr 14 15:07:30 karen kernel: [] ? bio_alloc_bioset+0x71/0xbf
Apr 14 15:07:30 karen kernel: [] ? __vdev_disk_physio+0x27f/0x2fc [zfs]
Apr 14 15:07:30 karen kernel: [] ? vdev_disk_io_start+0xcf/0xe7 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_start+0x217/0x22a [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_nowait+0xe4/0x10e [zfs]
Apr 14 15:07:30 karen kernel: [] ? wake_up_process+0x10/0x12
Apr 14 15:07:30 karen kernel: [] ? vdev_queue_io_done+0x76/0xad [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_done+0x6c/0x157 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_execute+0xc1/0xeb [zfs]
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x2db/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? default_wake_function+0x0/0xf
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x0/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? kthread+0x7d/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x4/0x10
Apr 14 15:07:30 karen kernel: [] ? kthread+0x0/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x0/0x10
Apr 14 15:07:30 karen kernel: Mem-Info:
Apr 14 15:07:30 karen kernel: Node 0 DMA per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: CPU 1: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: Node 0 DMA32 per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 186, btch: 31 usd: 0
Apr 14 15:07:30 karen kernel: CPU 1: hi: 186, btch: 31 usd: 0
Apr 14 15:07:30 karen kernel: active_anon:89437 inactive_anon:92346 isolated_anon:0
Apr 14 15:07:30 karen kernel: active_file:8904 inactive_file:52856 isolated_file:0
Apr 14 15:07:30 karen kernel: unevictable:1045 dirty:20 writeback:0 unstable:0
Apr 14 15:07:30 karen kernel: free:731 slab_reclaimable:12803 slab_unreclaimable:45413
Apr 14 15:07:30 karen kernel: mapped:25626 shmem:99295 pagetables:7389 bounce:0
Apr 14 15:07:30 karen kernel: Node 0 DMA free:1872kB min:36kB low:44kB high:52kB active_anon:72kB inactive_anon:1016kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:1088kB slab_reclaimable:160kB slab_unreclaimable:344kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 2884 2884 2884
Apr 14 15:07:30 karen kernel: Node 0 DMA32 free:1052kB min:6852kB low:8564kB high:10276kB active_anon:357676kB inactive_anon:368368kB active_file:35616kB inactive_file:211424kB unevictable:4180kB isolated(anon):0kB isolated(file):0kB present:2953592kB mlocked:4180kB dirty:80kB writeback:0kB mapped:102504kB shmem:396092kB slab_reclaimable:51052kB slab_unreclaimable:181308kB kernel_stack:4048kB pagetables:29556kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 0 0 0
Apr 14 15:07:30 karen kernel: Node 0 DMA: 0_4kB 0_8kB 1_16kB 0_32kB 1_64kB 0_128kB 1_256kB 1_512kB 1_1024kB 0_2048kB 0_4096kB = 1872kB
Apr 14 15:07:30 karen kernel: Node 0 DMA32: 27_4kB 27_8kB 9_16kB 5_32kB 3_64kB 2_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 1076kB
Apr 14 15:07:30 karen kernel: 161756 total pagecache pages
Apr 14 15:07:30 karen kernel: 1 pages in swap cache
Apr 14 15:07:30 karen kernel: Swap cache stats: add 1, delete 0, find 0/0
Apr 14 15:07:30 karen kernel: Free swap = 4194640kB
Apr 14 15:07:30 karen kernel: Total swap = 4194644kB
Apr 14 15:07:30 karen kernel: 752880 pages RAM
Apr 14 15:07:30 karen kernel: 13522 pages reserved
Apr 14 15:07:30 karen kernel: 131626 pages shared
Apr 14 15:07:30 karen kernel: 615629 pages non-shared
Apr 14 15:07:30 karen kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Apr 14 15:07:30 karen kernel: cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
Apr 14 15:07:30 karen kernel: node 0: slabs: 1150, objs: 36800, free: 0
Apr 14 15:07:30 karen kernel: z_wr_int/5: page allocation failure. order:0, mode:0x20
Apr 14 15:07:30 karen kernel: Pid: 1073, comm: z_wr_int/5 Tainted: P 2.6.38karen.dragonfear #1
Apr 14 15:07:30 karen kernel: Call Trace:
Apr 14 15:07:30 karen kernel: [] ? __alloc_pages_nodemask+0x677/0x6f6
Apr 14 15:07:30 karen kernel: [] ? alloc_pages_current+0xac/0xc5
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x18/0x1e
Apr 14 15:07:30 karen kernel: [] ? alloc_slab_page+0x17/0x45
Apr 14 15:07:30 karen kernel: [] ? new_slab+0x68/0x1a2
Apr 14 15:07:30 karen kernel: [] ? __slab_alloc+0x21f/0x2fd
Apr 14 15:07:30 karen kernel: [] ? do_intel_finish_page_flip+0x65/0x1f7 [i915]
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? kmem_cache_alloc+0x56/0xde
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? scsi_pool_alloc_command+0x41/0x66
Apr 14 15:07:30 karen kernel: [] ? scsi_host_alloc_command+0x1c/0x67
Apr 14 15:07:30 karen kernel: [] ? __scsi_get_command+0x19/0x94
Apr 14 15:07:30 karen kernel: [] ? scsi_get_command+0x38/0xa0
Apr 14 15:07:30 karen kernel: [] ? scsi_setup_fs_cmnd+0x69/0xb9
Apr 14 15:07:30 karen kernel: [] ? sd_prep_fn+0x282/0xb91
Apr 14 15:07:30 karen kernel: [] ? __ata_scsi_queuecmd+0x189/0x1db
Apr 14 15:07:30 karen kernel: [] ? blk_peek_request+0xe0/0x1b8
Apr 14 15:07:30 karen kernel: [] ? scsi_request_fn+0x7a/0x412
Apr 14 15:07:30 karen kernel: [] ? __blk_run_queue+0x45/0x75
Apr 14 15:07:30 karen kernel: [] ? blk_run_queue+0x23/0x37
Apr 14 15:07:30 karen kernel: [] ? scsi_run_queue+0x2e6/0x323
Apr 14 15:07:30 karen kernel: [] ? scsi_next_command+0x36/0x46
Apr 14 15:07:30 karen kernel: [] ? scsi_io_completion+0x3ec/0x43b
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? scsi_finish_command+0xe0/0xe9
Apr 14 15:07:30 karen kernel: [] ? scsi_softirq_done+0x104/0x10d
Apr 14 15:07:30 karen kernel: [] ? blk_done_softirq+0x6e/0x82
Apr 14 15:07:30 karen kernel: [] ? __do_softirq+0xc1/0x181
Apr 14 15:07:30 karen kernel: [] ? tick_program_event+0x1a/0x1c
Apr 14 15:07:30 karen kernel: [] ? call_softirq+0x1c/0x30
Apr 14 15:07:30 karen kernel: [] ? do_softirq+0x41/0x7c
Apr 14 15:07:30 karen kernel: [] ? irq_exit+0x36/0x78
Apr 14 15:07:30 karen kernel: [] ? smp_apic_timer_interrupt+0x79/0x87
Apr 14 15:07:30 karen kernel: [] ? apic_timer_interrupt+0x13/0x20
Apr 14 15:07:30 karen kernel: [] ? spin_unlock_irq+0xe/0x10
Apr 14 15:07:30 karen kernel: [] ? scsi_request_fn+0x358/0x412
Apr 14 15:07:30 karen kernel: [] ? __generic_unplug_device+0x32/0x36
Apr 14 15:07:30 karen kernel: [] ? elv_insert+0x172/0x17b
Apr 14 15:07:30 karen kernel: [] ? __elv_add_request+0x6e/0x75
Apr 14 15:07:30 karen kernel: [] ? __make_request+0x353/0x3ce
Apr 14 15:07:30 karen kernel: [] ? try_to_freeze+0xf/0x26
Apr 14 15:07:30 karen kernel: [] ? generic_make_request+0x285/0x2fe
Apr 14 15:07:30 karen kernel: [] ? arch_local_irq_save+0x16/0x1c
Apr 14 15:07:30 karen kernel: [] ? submit_bio+0xdb/0xfa
Apr 14 15:07:30 karen kernel: [] ? bio_alloc_bioset+0x71/0xbf
Apr 14 15:07:30 karen kernel: [] ? __vdev_disk_physio+0x27f/0x2fc [zfs]
Apr 14 15:07:30 karen kernel: [] ? vdev_disk_io_start+0xcf/0xe7 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_start+0x217/0x22a [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_nowait+0xe4/0x10e [zfs]
Apr 14 15:07:30 karen kernel: [] ? wake_up_process+0x10/0x12
Apr 14 15:07:30 karen kernel: [] ? vdev_queue_io_done+0x76/0xad [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_vdev_io_done+0x6c/0x157 [zfs]
Apr 14 15:07:30 karen kernel: [] ? zio_execute+0xc1/0xeb [zfs]
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x2db/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? default_wake_function+0x0/0xf
Apr 14 15:07:30 karen kernel: [] ? _raw_spin_unlock_irqrestore+0x12/0x14
Apr 14 15:07:30 karen kernel: [] ? taskq_thread+0x0/0x31b [spl]
Apr 14 15:07:30 karen kernel: [] ? kthread+0x7d/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x4/0x10
Apr 14 15:07:30 karen kernel: [] ? kthread+0x0/0x85
Apr 14 15:07:30 karen kernel: [] ? kernel_thread_helper+0x0/0x10
Apr 14 15:07:30 karen kernel: Mem-Info:
Apr 14 15:07:30 karen kernel: Node 0 DMA per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: CPU 1: hi: 0, btch: 1 usd: 0
Apr 14 15:07:30 karen kernel: Node 0 DMA32 per-cpu:
Apr 14 15:07:30 karen kernel: CPU 0: hi: 186, btch: 31 usd: 2
Apr 14 15:07:30 karen kernel: CPU 1: hi: 186, btch: 31 usd: 0
Apr 14 15:07:30 karen kernel: active_anon:89437 inactive_anon:92346 isolated_anon:0
Apr 14 15:07:30 karen kernel: active_file:8904 inactive_file:52856 isolated_file:0
Apr 14 15:07:30 karen kernel: unevictable:1045 dirty:20 writeback:0 unstable:0
Apr 14 15:07:30 karen kernel: free:731 slab_reclaimable:12803 slab_unreclaimable:45413
Apr 14 15:07:30 karen kernel: mapped:25626 shmem:99295 pagetables:7389 bounce:0
Apr 14 15:07:30 karen kernel: Node 0 DMA free:1872kB min:36kB low:44kB high:52kB active_anon:72kB inactive_anon:1016kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15684kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:1088kB slab_reclaimable:160kB slab_unreclaimable:344kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 2884 2884 2884
Apr 14 15:07:30 karen kernel: Node 0 DMA32 free:1052kB min:6852kB low:8564kB high:10276kB active_anon:357676kB inactive_anon:368368kB active_file:35616kB inactive_file:211424kB unevictable:4180kB isolated(anon):0kB isolated(file):0kB present:2953592kB mlocked:4180kB dirty:80kB writeback:0kB mapped:102504kB shmem:396092kB slab_reclaimable:51052kB slab_unreclaimable:181308kB kernel_stack:4048kB pagetables:29556kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 14 15:07:30 karen kernel: lowmem_reserve[]: 0 0 0 0
Apr 14 15:07:30 karen kernel: Node 0 DMA: 0_4kB 0_8kB 1_16kB 0_32kB 1_64kB 0_128kB 1_256kB 1_512kB 1_1024kB 0_2048kB 0_4096kB = 1872kB
Apr 14 15:07:30 karen kernel: Node 0 DMA32: 27_4kB 27_8kB 9_16kB 5_32kB 3_64kB 2_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 1076kB
Apr 14 15:07:30 karen kernel: 161756 total pagecache pages
Apr 14 15:07:30 karen kernel: 1 pages in swap cache
Apr 14 15:07:30 karen kernel: Swap cache stats: add 1, delete 0, find 0/0
Apr 14 15:07:30 karen kernel: Free swap = 4194640kB
Apr 14 15:07:30 karen kernel: Total swap = 4194644kB
Apr 14 15:07:30 karen kernel: 752880 pages RAM
Apr 14 15:07:30 karen kernel: 13522 pages reserved
Apr 14 15:07:30 karen kernel: 131626 pages shared
Apr 14 15:07:30 karen kernel: 615627 pages non-shared
Apr 14 15:07:30 karen kernel: SLUB: Unable to allocate memory on node -1 (gfp=0x20)
Apr 14 15:07:30 karen kernel: cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
Apr 14 15:07:30 karen kernel: node 0: slabs: 1150, objs: 36800, free: 0
Apr 14 15:08:08 karen kernel: CE: hpet increased min_delta_ns to 20113 nsec

@gunnarbeutner
Copy link
Contributor

Rudd-O: Make sure you're using 1b09ecd2089d7c94ece0f3fb1743a88394f348f9 rather than ab84ad20ad8f726a056acc63f1a457a31b482aae - there's a missing iput() in the "old" patch which leads to leaked inode references.

@gunnarbeutner
Copy link
Contributor

And I've removed another superfluous iput() (in the if (ip == NULL) branch; this wouldn't cause any crashes though): a540221fd9d55fbb4f7f8b52b2106d107d237b58

@Rudd-O
Copy link
Contributor

Rudd-O commented Apr 15, 2011

Gunnar: cherry-picked a54. Going to test real soon.

@user318
Copy link

user318 commented Apr 21, 2011

I can not reproduce "kernel BUG at ... inode.c" with a540221 patch. But during tests zfs ate a lot of memory, I tried to off swap in process, got several errors of "SLUB can not allocate memory", turned swap on later. And after some time it seemes to become deadlocked, because it stopeed eating processor. And when i tried to ^C on running ls, du and rsync, I got this in log (part of it):

Apr 21 12:32:31 plum-old kernel: arc_reclaim   D ffff880024709d10     0  2274      2 0x00000000
Apr 21 12:32:31 plum-old kernel: ffff880024709cf0 0000000000000046 0000000000000000 ffff880024709c10
Apr 21 12:32:31 plum-old kernel: 0000000000000000 0000000000000001 ffff88003f493180 0000000000013180
Apr 21 12:32:31 plum-old kernel: 0000000000013180 ffff88003ca56d00 0000000000013180 ffff880024709fd8
Apr 21 12:32:31 plum-old kernel: Call Trace:
Apr 21 12:32:31 plum-old kernel: [] ? lock_timer_base.clone.23+0x36/0x70
Apr 21 12:32:31 plum-old kernel: [] __mutex_lock_slowpath+0x10a/0x290
Apr 21 12:32:31 plum-old kernel: [] ? thread_generic_wrapper+0x0/0x90 [spl]
Apr 21 12:32:31 plum-old kernel: [] mutex_lock+0x11/0x30
Apr 21 12:32:31 plum-old kernel: [] arc_evict+0x67/0x600 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? spl_slab_reclaim+0x3e/0x280 [spl]
Apr 21 12:32:31 plum-old kernel: [] ? thread_generic_wrapper+0x0/0x90 [spl]
Apr 21 12:32:31 plum-old kernel: [] arc_adjust+0x19e/0x1e0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] arc_reclaim_thread+0x72/0x130 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? arc_reclaim_thread+0x0/0x130 [zfs]
Apr 21 12:32:31 plum-old kernel: [] thread_generic_wrapper+0x73/0x90 [spl]
Apr 21 12:32:31 plum-old kernel: [] kthread+0x96/0xa0
Apr 21 12:32:31 plum-old kernel: [] kernel_thread_helper+0x4/0x10
Apr 21 12:32:31 plum-old kernel: [] ? kthread+0x0/0xa0
Apr 21 12:32:31 plum-old kernel: [] ? kernel_thread_helper+0x0/0x10
Apr 21 12:32:31 plum-old kernel: txg_quiesce   D 0000000107006dac     0  2584      2 0x00000000
Apr 21 12:32:31 plum-old kernel: ffff88000c731d90 0000000000000046 0000000000000003 ffff88000c731cb0
Apr 21 12:32:31 plum-old kernel: 0000000000000000 0000000000000000 0000000000013180 0000000000013180
Apr 21 12:32:31 plum-old kernel: 0000000000013180 ffff8800203e4420 0000000000013180 ffff88000c731fd8
Apr 21 12:32:31 plum-old kernel: Call Trace:
Apr 21 12:32:31 plum-old kernel: [] ? check_preempt_curr+0x84/0xa0
Apr 21 12:32:31 plum-old kernel: [] ? try_to_wake_up+0x1a8/0x300
Apr 21 12:32:31 plum-old kernel: [] ? __mutex_lock_slowpath+0x1e7/0x290
Apr 21 12:32:31 plum-old kernel: [] cv_wait_common+0x7b/0xe0 [spl]
Apr 21 12:32:31 plum-old kernel: [] ? autoremove_wake_function+0x0/0x40
Apr 21 12:32:31 plum-old kernel: [] __cv_wait+0xe/0x10 [spl]
Apr 21 12:32:31 plum-old kernel: [] txg_quiesce_thread+0x1eb/0x2b0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? txg_quiesce_thread+0x0/0x2b0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? thread_generic_wrapper+0x0/0x90 [spl]
Apr 21 12:32:31 plum-old kernel: [] thread_generic_wrapper+0x73/0x90 [spl]
Apr 21 12:32:31 plum-old kernel: [] kthread+0x96/0xa0
Apr 21 12:32:31 plum-old kernel: [] kernel_thread_helper+0x4/0x10
Apr 21 12:32:31 plum-old kernel: [] ? kthread+0x0/0xa0
Apr 21 12:32:31 plum-old kernel: [] ? kernel_thread_helper+0x0/0x10
Apr 21 12:32:31 plum-old kernel: cp            D ffff880030803898     0  4129   3703 0x00000004
Apr 21 12:32:31 plum-old kernel: ffff880030803878 0000000000000082 ffff88003b6b9000 ffff880030803798
Apr 21 12:32:31 plum-old kernel: ffff88003b8a0000 0000000000000246 ffff8800308037a8 0000000000013180
Apr 21 12:32:31 plum-old kernel: 0000000000013180 ffff88003bb63680 0000000000013180 ffff880030803fd8
Apr 21 12:32:31 plum-old kernel: Call Trace:
Apr 21 12:32:31 plum-old kernel: [] ? cpumask_next_and+0x36/0x50
Apr 21 12:32:31 plum-old kernel: [] ? select_idle_sibling+0x95/0x160
Apr 21 12:32:31 plum-old kernel: [] ? native_sched_clock+0x15/0x70
Apr 21 12:32:31 plum-old kernel: [] __mutex_lock_slowpath+0x10a/0x290
Apr 21 12:32:31 plum-old kernel: [] ? zio_data_buf_free+0x0/0x30 [zfs]
Apr 21 12:32:31 plum-old kernel: [] mutex_lock+0x11/0x30
Apr 21 12:32:31 plum-old kernel: [] arc_change_state.clone.5+0x2a0/0x2f0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] arc_evict+0x50f/0x600 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? default_wake_function+0xd/0x10
Apr 21 12:32:31 plum-old kernel: [] arc_get_data_buf.clone.13+0x236/0x460 [zfs]
Apr 21 12:32:31 plum-old kernel: [] arc_buf_alloc+0xb8/0xf0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] dbuf_read+0x3cd/0x840 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? zio_destroy+0xa9/0xe0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? dsl_pool_sync_context+0x23/0x30 [zfs]
Apr 21 12:32:31 plum-old kernel: [] dmu_buf_will_dirty+0x71/0xd0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] dmu_write_uio_dnode+0x70/0x150 [zfs]
Apr 21 12:32:31 plum-old kernel: [] dmu_write_uio_dbuf+0x44/0x60 [zfs]
Apr 21 12:32:31 plum-old kernel: [] zfs_write+0xbb8/0xbe0 [zfs]
Apr 21 12:32:31 plum-old kernel: [] ? __lock_page_killable+0x62/0x70
Apr 21 12:32:31 plum-old kernel: [] ? do_sync_read+0xd2/0x110
Apr 21 12:32:31 plum-old kernel: [] zpl_write_common+0x4d/0x80 [zfs]
Apr 21 12:32:31 plum-old kernel: [] zpl_write+0x63/0x90 [zfs]
Apr 21 12:32:31 plum-old kernel: [] vfs_write+0xc6/0x190
Apr 21 12:32:31 plum-old kernel: [] sys_write+0x4c/0x80
Apr 21 12:32:31 plum-old kernel: [] system_call_fastpath+0x16/0x1b

@behlendorf
Copy link
Contributor Author

Nice fix Gunnar, a540221fd9d55fbb4f7f8b52b2106d107d237b58! This looks very good, after some careful review and testing I've pulled it in to the master branch. This should help a lot, thanks for taking the time to fix this.

ryao added a commit to ryao/zfs that referenced this issue Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, this ensures that successful exit from this function has a
hold on the inode, which the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit to ryao/zfs that referenced this issue Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, this ensures that successful exit from this function has a
hold on the inode, which the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit to ryao/zfs that referenced this issue Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, zfs_zget() should exit with a hold on the inode, but that
was never present in the Linux port. iThis can lead to undefined
behavior, such as inodes that are evicted when they have users. The
function is modified to ensure that successful exit from this function
has a hold on the inode, which the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
ryao added a commit to ryao/zfs that referenced this issue Mar 25, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284
tried to address it by making an upcall to the VFS to learn whether an
inode is being evicted and spin until it leaves eviction. This is a hack
around the fact that we cannot ensure "SA" does immediate eviction by
hooking into generic_drop_inode(), which is GPL exported and cannot be
wrapped. Unfortunately, the act of calling ilookup to avoid this race
during writeback creates a deadlock:

[  602.268492] INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
[  602.268496]       Tainted: P           O 3.13.6 #1
[  602.268498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  602.268500] kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
[  602.268511] Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
[  602.268522]  ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
[  602.268526]  ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
[  602.268530]  ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
[  602.268534] Call Trace:
[  602.268541]  [<ffffffff813dd1d4>] schedule+0x24/0x70
[  602.268546]  [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
[  602.268552]  [<ffffffff810821c0>] ? autoremove_wake_function+0x40/0x40
[  602.268555]  [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
[  602.268559]  [<ffffffff811608c5>] ilookup+0x65/0xd0
[  602.268590]  [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
[  602.268594]  [<ffffffff813e013b>] ? __mutex_lock_slowpath+0x21b/0x360
[  602.268613]  [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
[  602.268631]  [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
[  602.268634]  [<ffffffff813dfe79>] ? mutex_unlock+0x9/0x10
[  602.268651]  [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
[  602.268669]  [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
[  602.268674]  [<ffffffff811071ec>] do_writepages+0x1c/0x40
[  602.268677]  [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
[  602.268680]  [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
[  602.268684]  [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
[  602.268689]  [<ffffffff81062cc1>] ? set_worker_desc+0x71/0x80
[  602.268692]  [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
[  602.268696]  [<ffffffff813e12b4>] ? _raw_spin_unlock_irq+0x14/0x40
[  602.268700]  [<ffffffff8106fd19>] ? finish_task_switch+0x59/0x130
[  602.268703]  [<ffffffff8106072c>] process_one_work+0x16c/0x490
[  602.268706]  [<ffffffff810613f3>] worker_thread+0x113/0x390
[  602.268710]  [<ffffffff810612e0>] ? manage_workers.isra.27+0x2a0/0x2a0
[  602.268713]  [<ffffffff81066edf>] kthread+0xdf/0x100
[  602.268717]  [<ffffffff8107877e>] ? arch_vtime_task_switch+0x8e/0xa0
[  602.268720]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190
[  602.268723]  [<ffffffff813e71fc>] ret_from_fork+0x7c/0xb0
[  602.268730]  [<ffffffff81066e00>] ? kthread_create_on_node+0x190/0x190

The return value from igrab() gives us the information that ifind()
provided without the risk of a deadlock. Ideally, we should ask upstream
to export generic_drop_inode() so that we can wrap it to properly handle
this situation, but until then, lets hook into the return value of
ifind() so that we do not deadlock here.

In addition, zfs_zget() should exit with a hold on the inode, but that
was never done in the Linux port when the inode had already been
constructed. This can lead to undefined behavior, such as inodes that
are evicted when they have users. The function is modified to ensure
that successful exit from this function has a hold on the inode, which
the code expects.

Signed-off-by: Richard Yao <[email protected]>

 Please enter the commit message for your changes. Lines starting
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Apr 3, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manor in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#180
behlendorf pushed a commit that referenced this issue Apr 4, 2014
#180 occurred because of a race between inode eviction and
zfs_zget(). 36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #180
ryao added a commit to ryao/zfs that referenced this issue Apr 9, 2014
openzfs#180 occurred because of a race between inode eviction and
zfs_zget(). openzfs/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#180
prateekpandey14 pushed a commit to prateekpandey14/zfs that referenced this issue Mar 13, 2019
upgrade issue from 0.7 to 0.8 due to strict check on degrade_io_seq key (openzfs#180)

Signed-off-by: Vishnu Itta <[email protected]>
sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022
The `active` column in `zpool iostat -oq` is usually 0, even when there
are actually many active operations.  The problem is that the code that
displays the counts is trying to display it as a difference from the
previous value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants