Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at dbuf.c:105:dbuf_dest() #3443

Closed
nedbass opened this issue May 26, 2015 · 10 comments
Closed

PANIC at dbuf.c:105:dbuf_dest() #3443

nedbass opened this issue May 26, 2015 · 10 comments
Milestone

Comments

@nedbass
Copy link
Contributor

nedbass commented May 26, 2015

Reproduced on master 4 commits past 0.6.4 (7fad629). Also 77 commits past 0.6.4 (65037d9). Process running unlink() destroys a dbuf while the txg_sync thread still holds db->db_mtx.

Reproducer was using 64 concurrent copies of xattrtest https://github.com/nedbass/xattrtest/tree/x. The ASSERT hits reliably during the unlink phase. I used this test script:

doit()
{
        local test=$1
        local dir=$2
        for ((i=0;i<64;i++)) ; do
                mkdir -p $dir/$i
                ( ./xattrtest -o $test -f 1024 -R -s 512 -p $dir/$i -v -d > $dir/$i/xattrtest.log ) &
        done
        wait
        awk '/\/second$/ {print  $4}' $dir/*/xattrtest.log | awk '{sum += $1} END {print sum}'
}

d=`mktemp -d /mnt/fish/XXXXXXXX`
echo -n " create: "
( doit 1 $d )
echo -n " setxattr: "
( doit 2 $d )
echo -n " getxattr: "
( doit 3 $d )
echo -n " unlink: "
( doit 4 $d )
echo
2015-05-26 10:21:44 ZFS: Loaded module v0.6.4-4_g7fad629 (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5
2015-05-26 10:22:25 SPL: using hostid 0xa8c08c40
2015-05-26 10:26:45 VERIFY3(((*(volatile typeof((&db->db_mtx)->m_owner) *)&((&db->db_mtx)->m_owner))) == ((void *)0)) failed (ffff88202691e040 == (null))
2015-05-26 10:26:45 PANIC at dbuf.c:105:dbuf_dest()
Showing stack for process 60942
crash> bt 60942
PID: 60942  TASK: ffff881015300040  CPU: 28  COMMAND: "xattrtest"
 #0 [ffff88101364d7e8] schedule at ffffffff8152d740
 #1 [ffff88101364d8c0] spl_panic at ffffffffa08fd23d [spl]
 #2 [ffff88101364da60] dbuf_dest at ffffffffa0a6a3c8 [zfs]
 #3 [ffff88101364da80] spl_kmem_cache_free at ffffffffa08f9e06 [spl]
 #4 [ffff88101364dac0] dbuf_destroy at ffffffffa0a6a7b9 [zfs]
 #5 [ffff88101364db00] dbuf_evict at ffffffffa0a6d62a [zfs]
 #6 [ffff88101364db20] dnode_evict_bonus at ffffffffa0a9d2a7 [zfs]
 #7 [ffff88101364db50] dbuf_rele_and_unlock at ffffffffa0a6d0bb [zfs]
 #8 [ffff88101364dc00] dbuf_rele at ffffffffa0a6d2d5 [zfs]
 #9 [ffff88101364dc20] dmu_buf_rele at ffffffffa0a6e57e [zfs]
#10 [ffff88101364dc30] sa_handle_destroy at ffffffffa0aca4ec [zfs]
#11 [ffff88101364dc50] zfs_znode_dmu_fini at ffffffffa0b4a91f [zfs]
#12 [ffff88101364dc90] zfs_znode_delete at ffffffffa0b4ad3e [zfs]
#13 [ffff88101364dce0] zfs_rmnode at ffffffffa0b21358 [zfs]
#14 [ffff88101364dd20] zfs_zinactive at ffffffffa0b4abb4 [zfs]
#15 [ffff88101364dd60] zfs_inactive at ffffffffa0b4161f [zfs]
#16 [ffff88101364ddb0] zpl_clear_inode at ffffffffa0b66f82 [zfs]
#17 [ffff88101364ddd0] clear_inode at ffffffff811ad8e2
#18 [ffff88101364ddf0] zpl_inode_delete at ffffffffa0b66c80 [zfs]
#19 [ffff88101364de10] generic_delete_inode at ffffffff811adfce
#20 [ffff88101364de40] zpl_drop_inode at ffffffffa0b66c9e [zfs]
#21 [ffff88101364de50] iput at ffffffff811acf62
#22 [ffff88101364de70] do_unlinkat at ffffffff811a1b39
#23 [ffff88101364df70] sys_unlink at ffffffff811a1c06
#24 [ffff88101364df80] system_call_fastpath at ffffffff8100b0b2
    RIP: 00002aaaaada9657  RSP: 00007fffffffd988  RFLAGS: 00010246
    RAX: 0000000000000057  RBX: ffffffff8100b0b2  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 00000000004032fc  RDI: 0000000000616010
    RBP: 00007fffffffe2c0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: ffffffff811a1c06
    R13: ffff88101364df78  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b
ll_fastpath at ffffffff8100b0b2
    RIP: 00002aaaaada9657  RSP: 00007fffffffd988  RFLAGS: 00010246
    RAX: 0000000000000057  RBX: ffffffff8100b0b2  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 00000000004032fc  RDI: 0000000000616010
    RBP: 00007fffffffe2c0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: ffffffff811a1c06
    R13: ffff88101364df78  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b

Here we see txg_sync is the mutex owner:

crash> task ffff88202691e040 | head -1
PID: 59618  TASK: ffff88202691e040  CPU: 25  COMMAND: "txg_sync"
[root@zwicky-lcw-mds1:xattrtest]# uname -a
Linux zwicky-lcw-mds1 2.6.32-504.12.2.2chaos.ch5.3.x86_64 #1 SMP Fri Mar 20 13:20:55 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux
@nedbass
Copy link
Contributor Author

nedbass commented May 26, 2015

This duplicates #3249. I'll leave this open since there's more information here.

@nedbass
Copy link
Contributor Author

nedbass commented May 27, 2015

Bisected the bug to this commit:

4c7b7ee Illumos 5630 - stale bonus buffer in recycled dnode_t leads to data corruption

Not surprising since that's where dnode_evict_bonus() was added.

@behlendorf
Copy link
Contributor

@nedbass I suspect the additional dnode_evict_bonus() call here is just exposing another existing bug. One which would probably be harmless and go unnoticed on other platforms because I don't believe they ASSERT when destroying a mutex if the current task if the holder.

I haven't dug too deeply in to this but I'm concerned about the following hunk from dbuf_clear(). It's not at all obvious to me that we're guaranteed that arc_clear_callback() has dropped the lock and set dbuf_gone appropriately. I'd be very tempted to add an ASSERT(!MUTEX_HELD(&db->db_mtx)); after this block. This must be true and I suspect it's not.

        if (db->db_buf)
                dbuf_gone = arc_clear_callback(db->db_buf);

        if (!dbuf_gone)
                mutex_exit(&db->db_mtx);

@nedbass
Copy link
Contributor Author

nedbass commented May 27, 2015

I added the ASSERT as you suggested but didn't hit it before hitting the one from this issue.

@nedbass
Copy link
Contributor Author

nedbass commented May 27, 2015

I added some debugging to the SPL mutex implementation to get a stack trace from the thread holding the mutex being destroyed.

diff --git a/include/sys/mutex.h b/include/sys/mutex.h
index 9b297e9..fec1d68 100644
--- a/include/sys/mutex.h
+++ b/include/sys/mutex.h
@@ -29,6 +29,10 @@
 #include <linux/mutex.h>
 #include <linux/compiler_compat.h>

+
+#define SPL_MUTEX_MAGIC 0xABABABABABABABAB
+#define SPL_MUTEX_POISON ~(SPL_MUTEX_MAGIC)
+
 typedef enum {
        MUTEX_DEFAULT   = 0,
        MUTEX_SPIN      = 1,
@@ -39,6 +43,7 @@ typedef struct {
        struct mutex            m_mutex;
        spinlock_t              m_lock; /* used for serializing mutex_exit */
        kthread_t               *m_owner;
+       uint64_t                m_magic;
 } kmutex_t;

 #define        MUTEX(mp)               (&((mp)->m_mutex))
@@ -75,18 +80,20 @@ spl_mutex_clear_owner(kmutex_t *mp)
        __mutex_init(MUTEX(mp), (name) ? (#name) : (#mp), &__key); \
        spin_lock_init(&(mp)->m_lock);                          \
        spl_mutex_clear_owner(mp);                              \
+       (mp)->m_magic = SPL_MUTEX_MAGIC;                        \
 }

 #undef mutex_destroy
 #define        mutex_destroy(mp)                                       \
 {                                                              \
-       VERIFY3P(mutex_owner(mp), ==, NULL);                    \
+       (mp)->m_magic = SPL_MUTEX_POISON;                       \
 }

 #define        mutex_tryenter(mp)                                      \
 ({                                                             \
        int _rc_;                                               \
                                                                \
+       VERIFY3P((mp)->m_magic, ==, SPL_MUTEX_MAGIC);           \
        if ((_rc_ = mutex_trylock(MUTEX(mp))) == 1)             \
                spl_mutex_set_owner(mp);                        \
                                                                \
@@ -97,6 +104,7 @@ spl_mutex_clear_owner(kmutex_t *mp)
 #define        mutex_enter_nested(mp, subclass)                        \
 {                                                              \
        ASSERT3P(mutex_owner(mp), !=, current);                 \
+       VERIFY3P((mp)->m_magic, ==, SPL_MUTEX_MAGIC);           \
        mutex_lock_nested(MUTEX(mp), (subclass));               \
        spl_mutex_set_owner(mp);                                \
 }
@@ -104,6 +112,7 @@ spl_mutex_clear_owner(kmutex_t *mp)
 #define        mutex_enter_nested(mp, subclass)                        \
 {                                                              \
        ASSERT3P(mutex_owner(mp), !=, current);                 \
+       VERIFY3P((mp)->m_magic, ==, SPL_MUTEX_MAGIC);           \
        mutex_lock(MUTEX(mp));                                  \
        spl_mutex_set_owner(mp);                                \
 }
@@ -133,6 +142,7 @@ spl_mutex_clear_owner(kmutex_t *mp)
 #define        mutex_exit(mp)                                          \
 {                                                              \
        spin_lock(&(mp)->m_lock);                               \
+       VERIFY3P((mp)->m_magic, ==, SPL_MUTEX_MAGIC);           \
        spl_mutex_clear_owner(mp);                              \
        mutex_unlock(MUTEX(mp));                                \
        spin_unlock(&(mp)->m_lock);                             \

Here is the stack trace I got from txg_sync.

VERIFY3((&db->db_mtx)->m_magic == 0xABABABABABABABAB) failed (5454545454545454 == abababababababab)
PANIC at dmu_objset.c:1390:dmu_objset_userquota_get_ids()
Showing stack for process 21160
crash> bt 21160
PID: 21160  TASK: ffff88201b6e9500  CPU: 20  COMMAND: "txg_sync"
 #0 [ffff88201932b700] schedule at ffffffff8152d740
 #1 [ffff88201932b7d8] spl_panic at ffffffffa0792b0d [spl]
 #2 [ffff88201932b978] dmu_objset_userquota_get_ids at ffffffffa0aa3310 [zfs]
 #3 [ffff88201932b9e8] dnode_sync at ffffffffa0abf9fe [zfs]
 #4 [ffff88201932bad8] dmu_objset_sync_dnodes at ffffffffa0aa2a77 [zfs]
 #5 [ffff88201932bb28] dmu_objset_sync at ffffffffa0aa2da6 [zfs]
 #6 [ffff88201932bc18] dsl_dataset_sync at ffffffffa0ac2be2 [zfs]
 #7 [ffff88201932bc48] dsl_pool_sync at ffffffffa0ad8a11 [zfs]
 #8 [ffff88201932bcc8] spa_sync at ffffffffa0af83ab [zfs]
 #9 [ffff88201932bda8] txg_sync_thread at ffffffffa0b16f85 [zfs]
#10 [ffff88201932beb8] thread_generic_wrapper at ffffffffa078eb31 [spl]
#11 [ffff88201932bee8] kthread at ffffffff8109ff4e
#12 [ffff88201932bf48] kernel_thread at ffffffff8100c24a

@behlendorf
Copy link
Contributor

That stack is tremendously helpful. There's definitely a race in dmu_objset_userquota_get_ids where dn->dn_bonus is checked and trusted outside the dn->dn_struct_rwlock lock. One way to tackle this would be the following but I don't care for this fix much since I suspect in the common case the dn->dn_struct_rwlock isn't held and there is a reference so this is safe. So consider this a starting point for further discussion.

diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
index ae4e1dd..0f455cd 100644
--- a/module/zfs/dmu_objset.c
+++ b/module/zfs/dmu_objset.c
@@ -1325,7 +1325,7 @@ dmu_objset_userquota_get_ids(dnode_t *dn, boolean_t before
        if (before && dn->dn_bonuslen != 0)
                data = DN_BONUS(dn->dn_phys);
        else if (!before && dn->dn_bonuslen != 0) {
-               if (dn->dn_bonus) {
+               if (RW_WRITE_HELD(&dn->dn_struct_rwlock) && dn->dn_bonus) {
                        db = dn->dn_bonus;
                        mutex_enter(&db->db_mtx);
                        data = dmu_objset_userquota_find_data(db, tx);

As an aside, I don't see why this issue wouldn't be in illumos as well.

nedbass added a commit to nedbass/zfs that referenced this issue May 28, 2015
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is beeing freed then the bonus
dbuf may be in process of getting evicted. In this case there is a race
that may cause dmu_objset_userquota_get_ids() to access the dbuf after
it has been destroyed. To prevent this, ensure that we are either
holding dn_struct_rwlock or a reference to the bonus dbuf when calling
dmu_objset_userquota_get_ids(). Rename dmu_objset_userquota_get_ids()
with an _impl suffix and add a wrapper function take a reference on the
bonus dbuf (if needed) before calling it. This was done to keep the code
changes simple.

Secondly, make a small change to dbuf_try_add_ref(). It checks
db->db_blkid which may not be safe since it doesn't yet have a hold on
the dbuf. Use the blkid argument instead.

Signed-off-by: Ned Bass <[email protected]>
Issue openzfs#3443
nedbass added a commit to nedbass/zfs that referenced this issue May 29, 2015
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is beeing freed then the bonus
dbuf may be in process of getting evicted. In this case there is a race
that may cause dmu_objset_userquota_get_ids() to access the dbuf after
it has been destroyed. To prevent this, ensure that we are either
holding dn_struct_rwlock or a reference to the bonus dbuf when calling
dmu_objset_userquota_get_ids().

Secondly, don't check db->bb_blkid in dbuf_try_add_ref(), but use the
blkid argument instead. Checking db->db_blkid may be unsafe since we
doesn't yet have a hold on the dbuf so it's validity is unknown.

Signed-off-by: Ned Bass <[email protected]>
Issue openzfs#3443
nedbass added a commit to nedbass/zfs that referenced this issue May 29, 2015
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is beeing freed then the bonus
dbuf may be in process of getting evicted. In this case there is a race
that may cause dmu_objset_userquota_get_ids() to access the dbuf after
it has been destroyed. To prevent this, ensure that we are either
holding dn_struct_rwlock or a reference to the bonus dbuf when calling
dmu_objset_userquota_get_ids().

Secondly, don't check db->bb_blkid in dbuf_try_add_ref(), but use the
blkid argument instead. Checking db->db_blkid may be unsafe since we
doesn't yet have a hold on the dbuf so it's validity is unknown.

Signed-off-by: Ned Bass <[email protected]>
Issue openzfs#3443
nedbass added a commit to nedbass/zfs that referenced this issue May 30, 2015
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is being freed then the bonus
dbuf may be in the process of getting evicted. In this case there is a
race that may cause dmu_objset_userquota_get_ids() to access the dbuf
after it has been destroyed. To prevent this, ensure that when we are
using the bonus dbuf we are either holding a reference on it or have
taken dn_struct_rwlock.

Secondly, don't check db->bb_blkid in dbuf_try_add_ref(), but use the
blkid argument instead. Checking db->db_blkid may be unsafe since we
doesn't yet have a hold on the dbuf so its validity is unknown.

Signed-off-by: Ned Bass <[email protected]>
Issue openzfs#3443
nedbass added a commit to nedbass/zfs that referenced this issue Jun 1, 2015
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is being freed then the bonus
dbuf may be in the process of getting evicted. In this case there is a
race that may cause dmu_objset_userquota_get_ids() to access the dbuf
after it has been destroyed. To prevent this, ensure that when we are
using the bonus dbuf we are either holding a reference on it or have
taken dn_struct_rwlock.

Signed-off-by: Ned Bass <[email protected]>
Issue openzfs#3443
nedbass added a commit that referenced this issue Jun 5, 2015
- Don't check db->bb_blkid, but use the blkid argument instead.
  Checking db->db_blkid may be unsafe since we doesn't yet have a
  hold on the dbuf so its validity is unknown.

- Call mutex_exit() on found_db, not db, since it's not certain that
  they point to the same dbuf, and the mutex was taken on found_db.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3443
@nedbass nedbass closed this as completed in 5f8e1e8 Jun 5, 2015
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jun 8, 2015
- Don't check db->bb_blkid, but use the blkid argument instead.
  Checking db->db_blkid may be unsafe since we doesn't yet have a
  hold on the dbuf so its validity is unknown.

- Call mutex_exit() on found_db, not db, since it's not certain that
  they point to the same dbuf, and the mutex was taken on found_db.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#3443
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jun 8, 2015
The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus
outside of dn_struct_rwlock. If the dnode is being freed then the bonus
dbuf may be in the process of getting evicted. In this case there is a
race that may cause dmu_objset_userquota_get_ids() to access the dbuf
after it has been destroyed. To prevent this, ensure that when we are
using the bonus dbuf we are either holding a reference on it or have
taken dn_struct_rwlock.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#3443
@behlendorf behlendorf added this to the 0.7.0 milestone Sep 25, 2015
@behlendorf
Copy link
Contributor

Re-opening issue. The proposed fix for this has been reverted because it introduced #3789.

@behlendorf behlendorf reopened this Sep 25, 2015
behlendorf added a commit that referenced this issue Sep 25, 2015
This reverts commit 5f8e1e8.  It
was determined that this patch introduced the quota regression
described in #3789.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3443
Issue #3789
@nedbass
Copy link
Contributor Author

nedbass commented Sep 29, 2015

https://reviews.csiden.org/r/245/ may be relevant to this bug

behlendorf added a commit that referenced this issue Sep 30, 2015
This reverts commit 5f8e1e8.  It
was determined that this patch introduced the quota regression
described in #3789.

Signed-off-by: Tim Chase <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #3443
Issue #3789
@nedbass
Copy link
Contributor Author

nedbass commented Oct 1, 2015

@scsiguy I haven't yet tried your patch, but I'm no longer able to reproduce this bug under ZoL 0.6.5. I could test it under the older version where this bug does occur, but I don't think that would be particularly useful. First I want to find out which change "fixed" this bug in our tree.

nedbass pushed a commit to nedbass/zfs that referenced this issue Oct 2, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this issue Oct 6, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this issue Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this issue Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this issue Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this issue Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Oct 13, 2015
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <[email protected]>
Reviewed by: Xin LI <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>

References:
  https://www.illumos.org/issues/6267
  illumos/illumos-gate@d205810

Signed-off-by: Brian Behlendorf <[email protected]>
Ported-by: Ned Bass [email protected]
Issue openzfs#3865
Issue openzfs#3443
nedbass pushed a commit that referenced this issue Oct 14, 2015
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <[email protected]>
Reviewed by: Xin LI <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>

References:
  https://www.illumos.org/issues/6267
  illumos/illumos-gate@d205810

Signed-off-by: Brian Behlendorf <[email protected]>
Ported-by: Ned Bass <[email protected]>
Issue #3865
Issue #3443
@behlendorf behlendorf modified the milestones: 0.6.5.3, 0.7.0 Oct 15, 2015
@behlendorf
Copy link
Contributor

Upstream fix f9f5394 applied. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants