Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix avl_is_empty(&dn->dn_dbufs) assertion fails in dnode_sync_free #3865

Closed
wants to merge 3 commits into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Oct 1, 2015

The fix for Illumos 5056 (0c66c32) non-deterministically breaks zfs destroy in ClusterHQ's testing. The ClusterHQ issue is here:

https://clusterhq.atlassian.net/browse/ZFS-37

The Illumos 5056 fix was a deadlock fix. From past experience, when a fix introduces a regression, the right thing to do is to revert it in favor of a different solution that avoids the original problem. This appears to be the right thing to do here because:

  1. The Illumos 5056 patch added complexity for little in terms of documented gain. The only documented gain beyond the deadlock fix were stack space improvements that we did not particularly need.
  2. The original deadlock that appears to be fixable by a much simpler patch.
  3. The Illumos 5056 fix has been problematic in that it caused multiple other regressions that were fixed in Illumos by 0c66c32, 6186e29 and possibly other commits.

ryao added 3 commits October 1, 2015 11:59
This reverts commit f467b05.

This is because it is a change to
0c66c32, which is being reverted.

Signed-off-by: Richard Yao <[email protected]>
This reverts commit 0c66c32.

It caused a runtime failure:

https://clusterhq.atlassian.net/browse/ZFS-37

The original deadlock must be resolved differently.

All Spectralogic copyright notices introduced in the reverted commit
have been retained due to either additional changes that depend on them
that we presently have or additional changes that depend on them that we
yet to have.

Signed-off-by: Richard Yao <[email protected]>
Illumos 5056 was intended to fix this deadlock, but it addressed it in a
manner that added significant complexity without much in terms of
documented gains aside from fixing the deadlock. It also introduced a
runtime failure when assertions were enabled. It has been reverted in
favor of this simpler solution where we protect user callbacks with a
new db_user_mtx that allows us to drop the db_mtx during the callback.

Signed-off-by: Richard Yao <[email protected]>
@ryao ryao force-pushed the clusterhq-zfs-37 branch from aefcdd6 to 09ed063 Compare October 1, 2015 17:34
@ryao
Copy link
Contributor Author

ryao commented Oct 1, 2015

It looks like Justin Gibbs wrote a fix for this issue. I am closing my pull request in favor of it:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=202607#c22

@ryao ryao closed this Oct 1, 2015
@dweeezil
Copy link
Contributor

dweeezil commented Oct 1, 2015

I had meant to reference #3789 because he mentioned it there, too.

@scsiguy
Copy link
Contributor

scsiguy commented Oct 1, 2015

This history behind 5056 is a bit more complex than the illumos issue would lead one to believe. The patch was developed to fix several lock-order reversals and blocking loops in ZFS. Issue 5056 just happened to be filed around the time the patch was ready for review and so served as a good proxy for the work.

Did you have any additional analysis to show that the drop and reacquire of the dbuf mutex in your patch is safe? I'm guessing it is not.

@nedbass
Copy link
Contributor

nedbass commented Oct 2, 2015

The test case in #3443 turns out to be a fairly reliable reproducer of this bug. https://reviews.csiden.org/r/245/ seems to fix it. I left the test case running in a loop overnight and had zero occurrences.

nedbass pushed a commit to nedbass/zfs that referenced this pull request Oct 2, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
@scsiguy
Copy link
Contributor

scsiguy commented Oct 5, 2015

@nedbass can you provide review feedback at https://reviews.csiden.org/r/245/? I need a few reviewers before I can RTI it.

ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Oct 6, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
ryao pushed a commit to ClusterHQ/zfs that referenced this pull request Oct 9, 2015
The bonus buffer associated with a dnode is expected to remain resident
until: the dnode is evicted via dnode_buf_pageout(), the dnode is
freed in dnode_sync_free(), or the objset containing the dnode is
evicted via dmu_objset_evict(). However, since bonus buffers (and DMU
buffers in general) can have draining references when these events occur,
dbuf_rele_and_unlock() has logic to ensure that once these late references
are released affected buffers will be evicted.

dbuf_rele_and_unlock() currently checks for a dbuf for an evicting
objset via the os->os_evicting flag, and detects buffers for a freed
dnode by testing dn->dn_type and dn->dn_free_txg fields. Unfortunately,
the free'd dnode test can fire prematurely - anytime after the dnode is
scheduled to be freed via dnode_free() until the free is committed in
dnode_sync_free(). If all references to the bonus buffer are dropped
within this window, the bonus buffer will be evicted and code in
dnode_sync() that relies on the bonus buffer will fail.

Additionally, the "free'd dnode test" isn't applied to normal buffers
(buffers that are not the bonus buffer) and there is no mechanism to
guarantee eviction in the dnode_buf_pageout() case (the dnode is not
being freed nor is the objset being evicted).

Replace the two existing deferred eviction mechanisms with a per-dbuf
flag, db_pending_evict. This is set when explicit eviction is requested
via either dnode_evict_dbufs() or dnode_evict_bonus(). These actions
only occur after it is safe for dnode buffers to be evicted (e.g. the
bonus buffer will not be referenced again).

uts/common/fs/zfs/sys/dbuf.h:
	Add comments for boolean fields in dmu_buf_impl_t.

	Add the db_pending_evict field.

uts/common/fs/zfs/sys/dbuf.h:
uts/common/fs/zfs/dbuf.c:
	Rename db_immediate_evict to db_user_immediate_evict to avoid
	confusion between dbuf user state eviction and deferred eviction
	of a dbuf.

uts/common/fs/zfs/dbuf.c:
	Consistently use TRUE/FALSE for boolean fields in
	dmu_buf_impl_t.

	Simplify pending eviction logic to use the new db_pending_evict
	flag in all cases.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/sys/dmu_objset.h:
	Remove objset_t's os_evicting field. This same functionality
	is now provided by db_pending_evict.

uts/common/fs/zfs/dnode_sync.c:
	In dnode_evict_dbufs() and dnode_evict_bonus(), mark dbufs
	with draining references (dbufs that can't be evicted inline)
	as pending_evict.

	In dnode_sync_free(), remove ASSERT() that a dnode being free'd
	has no active dbufs. This is usually the case, but is not
	guaranteed due to draining references. (e.g. The deadlist for a
	deleted dataset may still be open if another thread referenced
	the dataset just before it was freed and the dsl_dataset_t hasn't
	been released or is still being evicted).

openzfs#3865
openzfs#3443
Ported-by: Ned Bass <[email protected]>
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Oct 13, 2015
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <[email protected]>
Reviewed by: Xin LI <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>

References:
  https://www.illumos.org/issues/6267
  illumos/illumos-gate@d205810

Signed-off-by: Brian Behlendorf <[email protected]>
Ported-by: Ned Bass [email protected]
Issue openzfs#3865
Issue openzfs#3443
nedbass pushed a commit that referenced this pull request Oct 14, 2015
6267 dn_bonus evicted too early
Reviewed by: Richard Yao <[email protected]>
Reviewed by: Xin LI <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Richard Lowe <[email protected]>

References:
  https://www.illumos.org/issues/6267
  illumos/illumos-gate@d205810

Signed-off-by: Brian Behlendorf <[email protected]>
Ported-by: Ned Bass <[email protected]>
Issue #3865
Issue #3443
@behlendorf
Copy link
Contributor

Merged as:

bc4501f Illumos 6267 - dn_bonus evicted too early

@behlendorf behlendorf added this to the 0.6.5.3 milestone Oct 15, 2015
MorpheusTeam pushed a commit to Xyratex/lustre-stable that referenced this pull request Dec 3, 2015
Bug Fixes

* Fix CPU hotplug openzfs/spl#482
* Disable dynamic taskqs by default to avoid deadlock
  openzfs/spl#484
* Don't import all visible pools in zfs-import init script
  openzfs/zfs#3777
* Fix use-after-free in vdev_disk_physio_completion
  openzfs/zfs#3920
* Fix avl_is_empty(&dn->dn_dbufs) assertion openzfs/zfs#3865

Signed-off-by: Nathaniel Clark <[email protected]>
Change-Id: I36347630be2506bee4ff0a05f1b236ba2ba7a0ae
Reviewed-on: http://review.whamcloud.com/16877
Reviewed-by: Alex Zhuravlev <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Tested-by: Jenkins
Tested-by: Maloo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants