-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock between mm_sem and tx assign in zfs_write() and page fault #7939
Conversation
7372bd7
to
629bcae
Compare
@wgqimut thanks for doing the leg work on this lock inversion. I agree with your analysis of the issue, but let me suggest an alternate solution. Rather than disabling the page faulting in Normally in a system call context we'd want to drop the offending locks and then try everything again. Since that's not an option here, and because a common caller is What do you think about this proposed change instead, behlendorf/zfs@f7a900f. It works as expected for me locally with your reproducer (thanks for that!) but I'd really like to know how it fairs in your environment. |
@behlendorf Thanks for your quick reply and review. Actually, we considered a similar way as your proposed fixing at the very beginning. Our concern is that bypass throttle checking for zfs_dirty_inode might have a potential risk to make a huge txg with too many dirty pages in highly frequent mmap access scenarios. Another side, I don't understand where is obvious performance penalty if we disable page fault during copy_from_user in zfs_write() context. There is prefault before txg_assign for every write iteration inside zfs_write(). The prefault here is to avoid page faults in uiomove(), so most of pages should have been faulted-in before uiomove(). So txg reassign and prefault retry ONLY happens when prefaulted page is evicted. copy data from user with pagefault disabled is not invented by us, actually I found other file systems do similar way in write() context as well, for example btrfs, ext4, fuse and even generic_perform_write() . We try to not break zfs txg write throttle and fix the issue. |
In addition, my test program cannot reproduce the origin hang problem effectively and constantly, sometimes the hang triggers in minutes, sometimes, it cost weeks to reproduce the problem in my test. But so far, my fixing version has been tested for about 2 weeks, so far so good. Do you have any good idea for the testing? |
That would be my major concern as well, in addition to potentially introducing the possibility for an inode update to be dropped. Regarding any performance penalty I have to say that was more of a gut feeling on my part. If you have performance data that says otherwise, I'm happy to withdraw that concern. Given the above I think the best solution here may be to use parts of both PRs. What do you think about something like this:
waited = B_FALSE;
top:
error = dmu_tx_assign(tx, waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT);
if (error) {
if (error == ERESTART && waited == B_FALSE) {
waited = B_TRUE;
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
As for making the test failure more likely you might try decreasing the |
Thanks for the comments. I accept your suggestion and am working on the refine. After the modification and testing, I will send you the test results. |
I have done my modification and testing. and There is something to explain. 1. About
|
old zfs write | new zfs write | old zfs rewrite | new zfs rewrite | |
---|---|---|---|---|
747 | 774 | 731 | 772 | |
803 | 837 | 779 | 749 | |
769 | 795 | 765 | 758 | |
786 | 762 | 741 | 747 | |
797 | 776 | 726 | 738 | |
775 | 749 | 714 | 726 | |
763 | 782 | 694 | 728 | |
765 | 743 | 719 | 717 | |
755 | 742 | 717 | 721 | |
768 | 744 | 742 | 705 | |
Average | 773 | 770 | 733 | 736 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
performance issue is not an exist.
Excellent, thanks for confirming that.
module/zcommon/zfs_uio.c
Outdated
if (fault_disable) { | ||
pagefault_disable(); | ||
if (__copy_from_user_inatomic(p, | ||
iov->iov_base+skip, cnt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to __copy_from_user_inatomic
will disable the access_ok()
checks. Can you point me to where these checks are still being done for the write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I should have check access_ok()
before calling __copy_from_user_inatomic
. I found copy_from_user
never been referenced with pagefault disabled by other kernel code. __copy_from_inatomic
can be called with page fault disabled.
module/zfs/zfs_vnops.c
Outdated
tx_bytes = uio->uio_resid; | ||
error = dmu_write_uio_dbuf(sa_get_db(zp->z_sa_hdl), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we can do is commit this tx, or unassign this tx.
It sounds like you investigated committing the empty tx and decided against it. Can you explain why. It looks like it would be relatively straight forward to slightly rework the existing if (tx_bytes == 0) {}
case to continue
on EFAULT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought if looks werid and not straight forward. I you think committing a empty txg is a better option, I'm okay with that. I will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I find committing the empty txg to be clearer. Although, I can see the argument for adding a reassign function to handle some cases like this. Let's get @ahrens thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to commit the empty tx and then go back to around to where the "Start a transaction." comment is. From your previous conversation, it sound like dmu_write_uio_dbuf() will fail infrequently, so performance is not a concern here.
module/zcommon/zfs_uio.c
Outdated
@@ -79,8 +81,19 @@ uiomove_iov(void *p, size_t n, enum uio_rw rw, struct uio *uio) | |||
if (copy_to_user(iov->iov_base+skip, p, cnt)) | |||
return (EFAULT); | |||
} else { | |||
if (copy_from_user(p, iov->iov_base+skip, cnt)) | |||
return (EFAULT); | |||
if (fault_disable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could do this with a flag in the uio_t
instead to avoid changing all these interfaces.
module/zcommon/zfs_uio.c
Outdated
pagefault_enable(); | ||
} else { | ||
if (copy_from_user(p, | ||
iov->iov_base+skip, cnt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cstyle: add spaces around the +
operator
498b443
to
4530de4
Compare
module/zfs/dmu_tx.c
Outdated
@@ -1040,6 +1040,7 @@ dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how) | |||
return (0); | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra white space.
module/zfs/zfs_vnops.c
Outdated
if (error == EFAULT) { | ||
uio_prefaultpages(MIN(n, max_blksz), uio); | ||
dmu_tx_commit(tx); | ||
goto top; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more correct to continue
here and go all the way back to the top of the while loop to re-run the quota checks. It's possible we may now be over quota.
cc157bc
to
1dcdf88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the updated PR is looking almost ready to integrate. When addressing this round of review feedback, please go ahead and rebase the PR on the latest code in the master branch. That should resolve the kmemleak failure reported by the Coverage builder.
module/zcommon/zfs_uio.c
Outdated
return (EFAULT); | ||
if (uio->uio_fault_disable) { | ||
if (!access_ok(VERIFY_READ, | ||
(iov->iov_base+skip), cnt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a space before and after the +
for the style checker. You can run make checkstyle
locally to run the same checks.
module/zcommon/zfs_uio.c
Outdated
|
||
pagefault_disable(); | ||
if (__copy_from_user_inatomic(p, | ||
(iov->iov_base+skip), cnt)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same white space around +
.
module/zcommon/zfs_uio.c
Outdated
pagefault_enable(); | ||
} else { | ||
if (copy_from_user(p, | ||
(iov->iov_base+skip), cnt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same white space around +
.
module/zfs/zfs_vnops.c
Outdated
error = dmu_tx_assign(tx, TXG_WAIT); | ||
boolean_t waited = B_FALSE; | ||
error = dmu_tx_assign(tx, | ||
waited ? (TXG_NOTHROTTLE | TXG_WAIT) : TXG_NOWAIT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more idiomatic, as well as easier to understand, if we do the same thing as other similar code, i.e.:
error = dmu_tx_assign(tx, (waited ? TXG_NOTHROTTLE : 0) | TXG_NOWAIT);
if (error) {
if (error == ERESTART) {
waited = B_TRUE;
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
If you search for NOTHROTTLE in this file, you'll see 8 other cases that do it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra complication which isn't obvious when reading this code in isolation which is why it was written a little differently. Despite the fact that this function allows an error to be returned, it's called from zpl_dirty_inode()
which is a Linux VFS callback functions (.dirty_inode) which must always succeed.
void (*dirty_inode) (struct inode *, int flags);
This is why the existing code used TXG_WAIT
since it is handled slightly differently in dmu_tx_try_assign()
.
if (spa_suspended(spa)) {
...
if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE &&
!(txg_how & TXG_WAIT))
return (SET_ERROR(EIO));
return (SET_ERROR(ERESTART));
}
If we don't use TXG_WAIT
it's possible the time update could be dropped and there's no way to report it. Perhaps we should instead add a comment explaining why this one is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, please add a comment explaining why we need to use TXG_WAIT. And I think we would need to use TXG_WAIT every time, not just if the first NOWAIT call fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I still don't understand how this works. The first time through, we call dmu_tx_assign(TXG_NOWAIT)
. If the pool is suspended, it will return EIO (per the code @behlendorf quoted above). zfs_dirty_inode() will then goto out
and return EIO. According to the comment you added, this would be incorrect behavior, because it "must always succeed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we should always assign by TXG_NOTHROTTLE | TXG_WAIT
, then dmu_tx_assign
can only return ERESTART
, and we try again and agian until we assign successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (tx->tx_dir != NULL && asize != 0) {
int err = dsl_dir_tempreserve_space(tx->tx_dir, memory,
asize, tx->tx_netfree, &tx->tx_tempreserve_cookie, tx);
if (err != 0)
return (err);
}
dmu_tx_try_assign
may return like EDQUOT
or ENOSPC
error in this place, this will make dirty inode
op failed, if I always retry no matter what error returned by dmu_tx_assign
, this may lead to infinite loop, do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the required semantic is "must always succeed", then your only options are to retry (potentially leading to an infinite loop), or panic (halt operation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if you get ENOSPC or EDQUOT (rather than ERESTART), then retrying will only work if some other process happens to free up space. We could consider using dmu_tx_mark_netfree() to greatly reduce the chances of ENOSPC/EDQUOT, by allowing it to use half the slop space, like unlink(). But it's still possible to get ENOSPC, so we'll have to decide how to handle it. See also the comments above dmu_tx_mark_netfree() and dsl_synctask.h:ZFS_SPACE_CHECK_RESERVED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comments is not accurate. We have no way to guarantee success of dirty_inode always. EQUOTA/ENOSPACE is not inevitable. After reference the implementation of other filesystems, I think the best effort is retry once and give up.
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required semantic is "must always succeed"...
The required semantics are that the inode must be marked dirty so it will be written latter. The code happens to take this opportunity to assign a transaction for the newly dirtied inode. However, the vfs does provide a second .write_inode
callback where the transaction assignment could be moved too. It does allow for failures and properly handles them.
From Documentation/filesystems/vfs.txt
the required semantics are:
void (*dirty_inode) (struct inode *, int flags);
int (*write_inode) (struct inode *, int);
dirty_inode: this method is called by the VFS to mark an inode dirty.
write_inode: this method is called when the VFS needs to write an
inode to disc. The second parameter indicates whether the write
should be synchronous or not, not all filesystems check this flag.
Splitting this up in something which has been on the list to investigate, but hasn't been critical since the existing code in practice works well. Getting this right and tested on all the supported kernels might also be tricky.
For the purposes on this PR why don't we leaved this code unchanged and continue to use TXG_WAIT
. This won't change the behavior and resolves the deadlock at hand. Then in a follow up PR, if @wgqimut is available he can investigate implementing the .write_inode
callback. For frequently dirtied inode there's potentially a significant performance win to be had.
module/zfs/zfs_vnops.c
Outdated
tx = dmu_tx_create(zfsvfs->z_os); | ||
|
||
dmu_tx_hold_sa(tx, zp->z_sa_hdl, B_FALSE); | ||
zfs_sa_upgrade_txholds(tx, zp); | ||
|
||
error = dmu_tx_assign(tx, TXG_WAIT); | ||
boolean_t waited = B_FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to the beginning of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is not just a stylistic suggestion; if we goto top
, we want waited
to be TRUE
Signed-off-by: Grady Wong <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #7939 +/- ##
==========================================
- Coverage 78.64% 78.5% -0.14%
==========================================
Files 377 377
Lines 114333 114232 -101
==========================================
- Hits 89912 89679 -233
- Misses 24421 24553 +132
Continue to review full report at Codecov.
|
Signed-off-by: Grady Wong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Grady Wong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your perseverance in getting all these last little issues addressed.
I'm marking this Accepted, but we need to check the test failures to make sure they are not related. The failures are:
|
These are all known existing issues. This is good to go. |
The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939
The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939
The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread openzfs#2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread openzfs#2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread openzfs#2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939
The bug time sequence: 1. thread openzfs#1, `zfs_write` assign a txg "n". 2. In a same process, thread openzfs#2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread openzfs#1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread openzfs#2, so it stuck and can't complete, then txg "n" will not complete. So thread openzfs#1 and thread openzfs#2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939
The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939
The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes openzfs#7939
The bug time sequence: 1. thread #1, `zfs_write` assign a txg "n". 2. In a same process, thread #2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. thread #1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by thread #2, so it stuck and can't complete, then txg "n" will not complete. So thread #1 and thread #2 are deadlocked. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Grady Wong <[email protected]> Closes #7939
The bug time sequence:
zfs_write
assign a txg "n".mm_sem
is hold) occurred,
zfs_dirty_inode
open a txg failed, and wait previoustxg "n" completed.
uiomove
to write, however page fault is occurred inuiomove
, which means it needmm_sem
, butmm_sem
is hold bycontext Implement zeventd daemon #2, so it stuck and can't complete, then txg "n" will not complete.
So context #1 and context #2 trap into the "dead lock".
Signed-off-by: Grady Wong [email protected]
Motivation and Context
#7512
Description
How Has This Been Tested?
Test being done based on zfs-0.7.9:
Regression testing: fstress, ztest
Unit testing program: Limit system memory to 1.5GB and try to simulate race condition of file write and mmap access in one process. One thread keep writing file A's 1 byte of each page, the other thread keep modifying 10 byte of mmaped one page of file B. The method increases the page fault rate inside
uiomove()
.The unit test program is attached below.
I didn't find a easy way to reproduce the problem constantly. But I no longer suffer the problem during the tests so far with my fixing. If you guys have any good idea to reproduce the problem, please shed me a light.
Types of changes
Checklist:
Signed-off-by
.